SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
localized consensus build instead of scaffolding sfh838t De novo discovery 1 03-16-2016 06:40 AM
Intron spanning reads result in plateaus in wiggle format ? Thias RNA Sequencing 3 05-09-2014 03:02 AM
Anyone know how Snowshoes-FTD finds junction-spanning reads? thondeboer Bioinformatics 2 09-04-2013 11:55 PM
Reads identifiers of reads spanning a SNP pm2012 Bioinformatics 5 04-09-2013 12:49 PM
Tophat: find junction spanning reads thurisaz RNA Sequencing 4 11-14-2011 04:23 AM

Reply
 
Thread Tools
Old 05-17-2017, 05:46 PM   #1
jmartin
Member
 
Location: St. Louis

Join Date: Dec 2009
Posts: 61
Default Best way to build consensus of short reads spanning viral gene

I have a collection of Illumina HiSeq 2000 reads that should span a specific coding region in a viral genome. The region these reads cover is 2625bp. What I want to do is generate a consensus of that region from all my reads.

The only thing I've tried so far is IDBA_UD. I downsampled to ~100x and ran it, but the assembly contigs summed up much larger than the region I know these reads should span. I also tried using all the data, but that was even further off base.

I have excessive coverage (~77000x), but the reads are from a population of quasi-species and have some variation. What would be the best tool to use to generate a consensus?
jmartin is offline   Reply With Quote
Old 05-17-2017, 06:14 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

BBMap's Tadpole (which I wrote) seems to do a good job of viral assembly for any coverage, both in my experience, and from what I've seen from others, so I suggest you give that a try. In some cases normalizing or subsampling the data can also improve assemblies, so that's worth trying as well. You already tried subsampling, but it's possible that a different tool would give different results. The BBMap package also includes BBNorm (which can normalize data) and Reformat (which can subsample the data); some assemblers simply cannot handle super-high coverage, so those operations can often make assemblers produce good assemblies from data that violates their heuristics.

Also - you did not mention anything about preprocessing. That can be very useful prior to assembly - adapter-trimming, contaminant-filtering, quality-trimming, reagent DNA removal, human DNA removal, etc. It's possible that much of your assembly is contaminant rather than genomic content of the virus in question.

Last edited by Brian Bushnell; 05-17-2017 at 06:21 PM.
Brian Bushnell is offline   Reply With Quote
Old 05-18-2017, 05:27 PM   #3
jmartin
Member
 
Location: St. Louis

Join Date: Dec 2009
Posts: 61
Default

Thanks for the reply! I went and tried Tadpole and I'm trying various things to fine tune the assembly. One thing I'm wondering is if there is a way to do a reference guided assembly in Tadpole?

Also, are there parameters you can suggest tweaking to try and be a bit more forgiving with regards to polymorphism in my input reads?
jmartin is offline   Reply With Quote
Old 05-18-2017, 05:57 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

Tadpole cannot d reference-guided assemblies - it is purely de-novo. And it's also rather unforgiving of polymorphisms, intentionally, to prevent misassemblies and assembly errors. However, you can often substantially increase the contiguity of viral assemblies by adjusting the branch multiplier flags - those tell it when to stop extending a contig because there is a branch in the graph, typically caused by a repeat or polymorphism. For example:

bm1=8 bm2=2.5

...will often substantially increase contiguity. You can reduce them even more from the defaults (20 and 3, respectively) to find the optimum (setting them both at 1 will not yield an optimal result ). I developed the default cutoffs for bacteria so they're not really ideal for viruses, and in fact, I don't know if it's possible in general to find good defaults for viruses because they tend to be very different and mutate rapidly.

It's also worth trying different kmer lengths. You can do this automatically with tadwrapper.sh. For example:

tadwrapper.sh in=reads.fq out=contigs%.fa k=31,62,93,124 expand bisect

That will try various kmer lengths and try to give you the optimal one for contiguity. It's not perfect, but you can just fire it off and ignore it until it finishes, which makes things easier. I developed it for bacterial isolates and metagenomes so I'm not entirely sure what it will do for viruses, but it's worth trying, and at least I expect it to produce a better value for K than the default of 31. 31 was chosen as default simply because it is the fastest and uses the least memory, not because it's the best. Normally, a larger value is better.

You will often also get better continuity if you first error-correct the reads with Tadpole. For example:

tadpole.sh in=reads.fq out=corrected.fq ecc k=62

Last edited by Brian Bushnell; 05-18-2017 at 05:59 PM.
Brian Bushnell is offline   Reply With Quote
Old 05-19-2017, 08:40 AM   #5
jmartin
Member
 
Location: St. Louis

Join Date: Dec 2009
Posts: 61
Default

Thanks Brian, I'll try playing a bit more. I'll try using tadpole's error correction too in case it deals with cases that I haven't already corrected.
jmartin is offline   Reply With Quote
Old 05-19-2017, 09:45 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

OK! Please let me know what settings you find to be optimal in your situation, and also whether Tadpole was better or worse than other assemblers.
Brian Bushnell is offline   Reply With Quote
Old 05-25-2017, 10:55 AM   #7
jmartin
Member
 
Location: St. Louis

Join Date: Dec 2009
Posts: 61
Default

It looks like the variation between quasispecies is making it difficult for tadpole to accomplish what I need, which is a sort of 'central' consensus amongst all these quasispecies which can serve as an anchor reference for mapping between samples. Tadpole ends up building a number of overlapping contigs, as well as leaving some gaps in coverage where maybe the input data is too confusing (too many 'haplotypes' of varying abundances?).

I think tadpole would be pretty nice as an assembler if I was working with homogenous samples, but for my usage case it may not be the right tool. I don't think its doing anything wrong since most people would probably want to keep the strains seperate. I just have an unusual task.
jmartin is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:48 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO