Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best way to build consensus of short reads spanning viral gene

    I have a collection of Illumina HiSeq 2000 reads that should span a specific coding region in a viral genome. The region these reads cover is 2625bp. What I want to do is generate a consensus of that region from all my reads.

    The only thing I've tried so far is IDBA_UD. I downsampled to ~100x and ran it, but the assembly contigs summed up much larger than the region I know these reads should span. I also tried using all the data, but that was even further off base.

    I have excessive coverage (~77000x), but the reads are from a population of quasi-species and have some variation. What would be the best tool to use to generate a consensus?

  • #2
    BBMap's Tadpole (which I wrote) seems to do a good job of viral assembly for any coverage, both in my experience, and from what I've seen from others, so I suggest you give that a try. In some cases normalizing or subsampling the data can also improve assemblies, so that's worth trying as well. You already tried subsampling, but it's possible that a different tool would give different results. The BBMap package also includes BBNorm (which can normalize data) and Reformat (which can subsample the data); some assemblers simply cannot handle super-high coverage, so those operations can often make assemblers produce good assemblies from data that violates their heuristics.

    Also - you did not mention anything about preprocessing. That can be very useful prior to assembly - adapter-trimming, contaminant-filtering, quality-trimming, reagent DNA removal, human DNA removal, etc. It's possible that much of your assembly is contaminant rather than genomic content of the virus in question.
    Last edited by Brian Bushnell; 05-17-2017, 06:21 PM.

    Comment


    • #3
      Thanks for the reply! I went and tried Tadpole and I'm trying various things to fine tune the assembly. One thing I'm wondering is if there is a way to do a reference guided assembly in Tadpole?

      Also, are there parameters you can suggest tweaking to try and be a bit more forgiving with regards to polymorphism in my input reads?

      Comment


      • #4
        Tadpole cannot d reference-guided assemblies - it is purely de-novo. And it's also rather unforgiving of polymorphisms, intentionally, to prevent misassemblies and assembly errors. However, you can often substantially increase the contiguity of viral assemblies by adjusting the branch multiplier flags - those tell it when to stop extending a contig because there is a branch in the graph, typically caused by a repeat or polymorphism. For example:

        bm1=8 bm2=2.5

        ...will often substantially increase contiguity. You can reduce them even more from the defaults (20 and 3, respectively) to find the optimum (setting them both at 1 will not yield an optimal result ). I developed the default cutoffs for bacteria so they're not really ideal for viruses, and in fact, I don't know if it's possible in general to find good defaults for viruses because they tend to be very different and mutate rapidly.

        It's also worth trying different kmer lengths. You can do this automatically with tadwrapper.sh. For example:

        tadwrapper.sh in=reads.fq out=contigs%.fa k=31,62,93,124 expand bisect

        That will try various kmer lengths and try to give you the optimal one for contiguity. It's not perfect, but you can just fire it off and ignore it until it finishes, which makes things easier. I developed it for bacterial isolates and metagenomes so I'm not entirely sure what it will do for viruses, but it's worth trying, and at least I expect it to produce a better value for K than the default of 31. 31 was chosen as default simply because it is the fastest and uses the least memory, not because it's the best. Normally, a larger value is better.

        You will often also get better continuity if you first error-correct the reads with Tadpole. For example:

        tadpole.sh in=reads.fq out=corrected.fq ecc k=62
        Last edited by Brian Bushnell; 05-18-2017, 05:59 PM.

        Comment


        • #5
          Thanks Brian, I'll try playing a bit more. I'll try using tadpole's error correction too in case it deals with cases that I haven't already corrected.

          Comment


          • #6
            OK! Please let me know what settings you find to be optimal in your situation, and also whether Tadpole was better or worse than other assemblers.

            Comment


            • #7
              It looks like the variation between quasispecies is making it difficult for tadpole to accomplish what I need, which is a sort of 'central' consensus amongst all these quasispecies which can serve as an anchor reference for mapping between samples. Tadpole ends up building a number of overlapping contigs, as well as leaving some gaps in coverage where maybe the input data is too confusing (too many 'haplotypes' of varying abundances?).

              I think tadpole would be pretty nice as an assembler if I was working with homogenous samples, but for my usage case it may not be the right tool. I don't think its doing anything wrong since most people would probably want to keep the strains seperate. I just have an unusual task.

              Comment


              • #8
                Originally posted by jmartin View Post
                It looks like the variation between quasispecies is making it difficult for tadpole to accomplish what I need, which is a sort of 'central' consensus amongst all these quasispecies which can serve as an anchor reference for mapping between samples. Tadpole ends up building a number of overlapping contigs, as well as leaving some gaps in coverage where maybe the input data is too confusing (too many 'haplotypes' of varying abundances?).

                I think tadpole would be pretty nice as an assembler if I was working with homogenous samples, but for my usage case it may not be the right tool. I don't think its doing anything wrong since most people would probably want to keep the strains seperate. I just have an unusual task.
                Hi Martin,

                Sorry for leaving message in this old post. But I find I meet similar situation with you and want to see if you have new idea after 2 years.

                If my understanding is right, your sequencing data is not that "purified", it has somewhat high diversity/polymorphism though they have similar backbone. To deal with such situation, most assembler would separate contigs in those confusing sites, which leads to quite a lot of contigs instead of a "consensus" contig.

                Thus, may I know if you have found any tools can do this more forgiving assembly job?

                Also, I do have a reference seq from relative species in my case, but it will have some insertion or deletion different from the current sequencing one, so I think most reference mapping tool (e.g., BWA) can not be used for consensus as they do not care InDel information when generating consensus. I think my case is also similar with your concern of "reference guided assembly"? If so, could you give me some suggestion of such tools to help me get the consensus sequence?

                Thanks.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                33 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                48 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                34 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X