SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBMerge: A paired-end read merger Brian Bushnell Bioinformatics 112 10-14-2017 01:54 PM
Introducing BBNorm, a read normalization and error-correction tool Brian Bushnell Bioinformatics 45 01-13-2017 01:09 AM
Introducing Reformat, a fast read format converter Brian Bushnell Bioinformatics 18 06-15-2016 01:51 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM

Reply
 
Thread Tools
Old 10-14-2015, 08:54 PM   #21
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

I expect that this particular method could be replaced by an 'if' statement:

Code:
if( ((char >= 'a') && (char <= 'z')) || ((char >= 'A') && (char <= 'Z')) || ((char >= '0') && (char <= '9')) )
gringer is offline   Reply With Quote
Old 10-14-2015, 09:08 PM   #22
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

True... but I still want to evaluate the difference in speed between that and a lookup-array - "if(array[char])", which would only require 128 bytes (assuming negative values were prefiltered, which they are, and that bytewise operations are faster than bitwise operations, which they also are) and a single L1 cache lookup.

I am assuming Java's native methods use a clever bitwise-and to determine whether the character is alphabetic in a single cycle* without a memory access, but if not, there's no reason to depend on library operations.

*Note - I'm not sure whether this is actually possible, it just seems likely.

Last edited by Brian Bushnell; 10-14-2015 at 09:11 PM.
Brian Bushnell is offline   Reply With Quote
Old 10-15-2015, 10:06 PM   #23
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 421
Default

Quote:
Originally Posted by gringer View Post
Oh, okay. I guess I missed the "merge" step of the assembly then. I just looked at the first sentence and didn't realise Tadpole was only an error corrector / extender:
I'm getting a little confused by this conversation. tadpole does act as an assembler when it isn't in read extend mode, right? I throw reads at it and it generates long contigs (like entire mitochondria). Is gringer's results as they are because in and in2 were used, priming tadpole to be in a different mode like extend only?

I was earlier confused about extend mode thinking that it applied to contigs, but extending my reads and then assembling with longer kmers gave a much better assembly with longer contigs.

The rinse option mentioned I am not sure about, though. How does that affect the output?

I also noticed the newer version of tadpole reports on scaffolds, but I thought it wasn't paired-end aware during assembly? How does the scaffolding come into play?

Sorry for the questions but I feel like there are all sorts of aspects to the tools that I am underutilizing.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 10-15-2015, 11:02 PM   #24
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

In default mode, Tadpole assembles reads and produces contigs. In "extend" or "correct" mode, it will extend or correct input sequences - which can be reads or contigs, but it's designed for reads. When I use Tadpole for assembly, I often first correct the reads, then assemble them, which takes two passes. Tadpole will build contigs unless you explicitly add the flag "mode=extend" or "mode=correct", regardless of whether you have 1 or 2 inputs. In extend or correct mode, it will modify the input reads, and not make contigs.

I'm glad to hear that you've achieved more contiguous assemblies after extending the reads and assembling them with longer kmers - that was my goal in designing that mode (and particularly, to allow merging of non-overlapping reads), but unfortunately I've been too busy to test it thoroughly. You've given me a second data point about it being beneficial, though, so thanks!

"shave" and "rinse" are what some assemblers call "tip removal" and "bubble removal". But, they are implemented a bit differently and occur before the graph is built, rather than as graph simplification routines. As such, they pose virtually no risk of causing misassemblies, and reduce the risk of misassemblies due to single chimeric reads. But unfortunately, in my experience, they also only provide very minor improvements in continuity or error-correction. Sometimes they make subsequent operations faster, though. By default, adding the flag "shave" will remove dead-end kmer paths of depth no more than 1 and length no more than 150 that branch out of a path with substantially greater depth. "rinse", similarly, only removes short paths of depth no more than 1 in which each end terminates in a branch node of substantially greater depth. Because these operations are so conservative, they seem to have little impact. Assemblers like Velvet and AllPaths-LG can collapse bubbles with a 50-50 split as to the path depth, which greatly increases the continuity (particularly with diploid organisms), but poses the risk of misassemblies when there are repeat elements. Tadpole always errs on the side of caution, preferring lower continuity to possible misassemblies.

Tadpole is still not pair-aware and does not perform scaffolding, though that's certainly my next goal, when I get a chance. When you generate contigs, Tadpole automatically runs AssemblyStats (which you can run as standalone using stats.sh). This mentions scaffolds in various places, because it's designed for assemblies that are potentially scaffolded, but you'll note that for Tadpole the scaffold statistics and contig statistics are identical.

Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram (in fact, I wrote a script to do this automatically). But in that case I don't actually use the error-correction or extension capabilities, as they are not usually necessary as the coverage is already incredibly high and low-depth kmers are being ignored. I use those more for single-cell work, which has lots of very-low-depth regions.
Brian Bushnell is offline   Reply With Quote
Old 10-16-2015, 02:23 AM   #25
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

Quote:
Originally Posted by SNPsaurus View Post
I'm getting a little confused by this conversation. tadpole does act as an assembler when it isn't in read extend mode, right? I throw reads at it and it generates long contigs (like entire mitochondria). Is gringer's results as they are because in and in2 were used, priming tadpole to be in a different mode like extend only?

I was earlier confused about extend mode thinking that it applied to contigs, but extending my reads and then assembling with longer kmers gave a much better assembly with longer contigs.

The rinse option mentioned I am not sure about, though. How does that affect the output?

I also noticed the newer version of tadpole reports on scaffolds, but I thought it wasn't paired-end aware during assembly? How does the scaffolding come into play?

Sorry for the questions but I feel like there are all sorts of aspects to the tools that I am underutilizing.
Thanks SNPsaurus. I looked at the examples that were provided in the first post and assumed that they were the most appropriate way to run the program. It seems a little odd that an assembler would stop assembling when you give it more parameters. Your comment has cleared this up a bit for me, and now I see that running without any arguments at all gives reasonable results.

I've now got a 3.6kb sequence that has crazy-high coverage (140564.6) which I assume is probably the viral sequence of interest. Here's the command I ran:

Code:
tadpole.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz  out=extended_assembled.fq
Presumably there are a few little tweaks I can add to that to get it working even better.

Last edited by gringer; 10-16-2015 at 02:31 AM.
gringer is offline   Reply With Quote
Old 10-16-2015, 04:19 AM   #26
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by gringer View Post
Thanks SNPsaurus. I looked at the examples that were provided in the first post and assumed that they were the most appropriate way to run the program. It seems a little odd that an assembler would stop assembling when you give it more parameters. Your comment has cleared this up a bit for me, and now I see that running without any arguments at all gives reasonable results.

I've now got a 3.6kb sequence that has crazy-high coverage (140564.6) which I assume is probably the viral sequence of interest. Here's the command I ran:

Code:
tadpole.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz  out=extended_assembled.fq
Presumably there are a few little tweaks I can add to that to get it working even better.
If you have super-high coverage like that, at a minimum, increasing K above the default (31) is usually helpful.
Brian Bushnell is offline   Reply With Quote
Old 10-16-2015, 10:01 AM   #27
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 421
Default

Thanks Brian for the notes. I'm actually using it for assembling data that resembles ChIP-Seq data, with high coverage of random shotgun fragments at discrete loci. The mtDNA is just along for the ride!

I do have paired-end data, and your comments about merging makes me wonder if I should add a step to the process.

Right now I take my fastq reads and extend with ecc=true, so extend with error correction. I then assemble at a higher kmer (71).

But would it be better to extend with error correction and then merge the paired reads, and then assemble? The fragments sequenced are longer than the read lengths, so I couldn't merge on the raw sequence files. But after extension they might reach each other. I was thinking of using a tool like http://www.ncbi.nlm.nih.gov/pubmed/26399504 but perhaps read extension and merging would achieve the same results.

edit: now I see merge included the tadpole extend as an option!
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

Last edited by SNPsaurus; 10-16-2015 at 11:18 AM.
SNPsaurus is offline   Reply With Quote
Old 10-17-2015, 04:48 AM   #28
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

Quote:
Originally Posted by Brian Bushnell View Post
If you have super-high coverage like that, at a minimum, increasing K above the default (31) is usually helpful.
Increasing k didn't help. If anything, it made the assembly worse. I've been able to generate 110 contigs with a coverage >500 and average length around 1000bp. I've attached a visual representation of the resulting FASTA file where sequences between homopolymers are coloured based on their GC%,AC%,TC%. Unfortunately, I can't see any common subsequences that would work for further merging.

Code to generate this type of image can be found here:

https://github.com/gringer/bioinfscr...r/fasta2svg.pl
Attached Images
File Type: png long_ecc_extended_assembled.png (122.4 KB, 20 views)
Attached Files
File Type: gz long_ecc_extended_assembled.svg.gz (71.5 KB, 0 views)
gringer is offline   Reply With Quote
Old 10-17-2015, 12:45 PM   #29
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

With extremely high coverage, it is often beneficial to normalize first, for any assembler. For example -

bbnorm.sh in=reads.fq out=normalized.fq min=2 target=100

You don't need that for 500-fold coverage, but it's worth trying for 140,000x coverage.
Brian Bushnell is offline   Reply With Quote
Old 10-20-2015, 04:16 PM   #30
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

I've recently discovered that these are transcript sequences, not genomic sequences (despite the initial assurance against that). So the 140,000x coverage and 500x coverage could possibly be for adjacent sequences, and it is [possibly] reasonable to expect target genes with coverage below 500.

Last edited by gringer; 10-20-2015 at 04:19 PM.
gringer is offline   Reply With Quote
Old 10-28-2015, 02:17 PM   #31
RobMaher
Junior Member
 
Location: Cambridge, MA

Join Date: Oct 2015
Posts: 2
Default

Hi, Brian,

Thanks for building this amazing assembler. I have been using it from within the Geneious software package for a few weeks now with great success. Quick question: is it possible to enter a command line under 'custom Tadpole options' that will cause the assembler to only use a specific percentage of the data? Also, is it possible to specify usage of maximum number of reads when performing the assembly (as opposed to a percent)?

I ask because I am trying to assembly a large number of plasmids from MiSeq FASTQ data, and I've found that the results are much better when only using 10% of the data. However, there is a bug wherein this option is missing from the workflow version of Tadpole; and another where it defaults to 100% when attempting to assemble each sequence list separately. They know about the first bug and are fixing it, but I thought there might be a quick workaround via the command line.

Thanks!

Rob
RobMaher is offline   Reply With Quote
Old 10-28-2015, 04:59 PM   #32
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Rob,

Tadpole does not do subsampling; you'd have to first sample 10% of the reads with another tool (such as Reformat, with the "samplerate=0.1" flag). However, you CAN restrict the input to a certain number of reads. "reads=400k", for example, will only use the first 400,000 reads (or if the reads are paired, the first 400,000 pairs).

Tadpole also has "mincr" and "maxcr" flags which allow you to ignore kmers with counts outside of some band, and is sometimes useful for ignoring low-depth junk than can occur when you have too much data:

Quote:
mincountretain=0 (mincr) Discard kmers with count below this.
maxcountretain=INF (maxcr) Discard kmers with count above this.
Also, I recommend you try normalization (targeting maybe 100x coverage) - there are many situations where normalization gives a better assembly than subsampling, particularly when the coverage of features is highly variable.

Good luck!
Brian Bushnell is offline   Reply With Quote
Old 10-28-2015, 09:07 PM   #33
RobMaher
Junior Member
 
Location: Cambridge, MA

Join Date: Oct 2015
Posts: 2
Default

Hi, Brian,

Thanks very much for your quick feedback. That does help, thank you. I was able to specify reads=1000 and that approximated the 10% for some test cases. I took a quick shot at normalizing, and I'm having some issues with it, but they may have more to do with Geneious than anything.

Thanks again!

Rob
RobMaher is offline   Reply With Quote
Old 11-06-2015, 05:45 AM   #34
nepossiver
Junior Member
 
Location: São Paulo

Join Date: Oct 2012
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram (in fact, I wrote a script to do this automatically). But in that case I don't actually use the error-correction or extension capabilities, as they are not usually necessary as the coverage is already incredibly high and low-depth kmers are being ignored. I use those more for single-cell work, which has lots of very-low-depth regions.
How do you identify the mitochondrial band? Could you share the script?
nepossiver is offline   Reply With Quote
Old 11-06-2015, 10:52 AM   #35
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Sure:

Code:
#First link reference as ref.fa and reads as reads.fa

/global/projectb/sandbox/gaag/bbtools/jgi-bbtools/kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt

primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))

bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999
reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45

kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2

mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`

upper=$((mitopeak * 6 / 3))
lower=$((mitopeak * 3 / 7))
mcs=$((mitopeak * 3 / 4))
mincov=$((mitopeak * 2 / 3))

tadpole.sh in=highpass_gc.fq.gz out=contigs78.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=78
tadpole.sh in=highpass_gc.fq.gz out=contigs100.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=100
tadpole.sh in=highpass_gc.fq.gz out=contigs120.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=120

bbduk.sh in=highpass.fq.gz ref=contigs100.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05

tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6

ln -s contigs_bbd.fa contigs.fa
This works well with error-corrected PacBio reads, which have a very unbiased coverage distribution. It was able to isolate and assemble the mitochondria in 4/4 fungi tested. I'm not sure how well it works on Illumina reads, though. Also, note that the default k=100 is designed for reads of at least 150bp.
Brian Bushnell is offline   Reply With Quote
Old 11-06-2015, 11:10 AM   #36
nepossiver
Junior Member
 
Location: São Paulo

Join Date: Oct 2012
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
This works well with error-corrected PacBio reads, which have a very unbiased coverage distribution. It was able to isolate and assemble the mitochondria in 4/4 fungi tested. I'm not sure how well it works on Illumina reads, though. Also, note that the default k=100 is designed for reads of at least 150bp.
ok, noted. I will test on Illumina reads and report back later. Many thanks.
nepossiver is offline   Reply With Quote
Old 04-18-2016, 06:09 AM   #37
JeanLove
Junior Member
 
Location: croatia

Join Date: Jan 2016
Posts: 5
Default

Hi! Although this is a very helpful post, there is still so much unclarities about Tadpole.
As input, I have fasta file with ~11 mil reads, and I would like for Tadpole to take 50% random reads from this fasta file.
I see that there is reads=-1 input parameter, where -1 (default) stands for all reads, so I am curious what could stand for 50% etc? Cannot find any info on this parameter, so I would appreciate any help!
JeanLove is offline   Reply With Quote
Old 04-18-2016, 06:29 AM   #38
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

Quote:
Originally Posted by JeanLove View Post
Hi! Although this is a very helpful post, there is still so much unclarities about Tadpole.
As input, I have fasta file with ~11 mil reads, and I would like for Tadpole to take 50% random reads from this fasta file.
I see that there is reads=-1 input parameter, where -1 (default) stands for all reads, so I am curious what could stand for 50% etc? Cannot find any info on this parameter, so I would appreciate any help!
Provide a number that you want to use e.g. reads=55000000. I don't know if that randomly subsamples though. It may be first 5.5M reads. @Brian will have to clarify.

If you want to independently subsample the reads you could also use reformat.sh or seqtk (sample) from Heng Li.
GenoMax is offline   Reply With Quote
Old 06-30-2016, 01:37 AM   #39
ycl6
Junior Member
 
Location: Taiwan

Join Date: Feb 2013
Posts: 2
Default

Hi, I am using tadpole from BBMap v36.11 and received some java error messages. The command I used is:

tadpole.sh -Xmx1000g in1=read1.corrected.fq.gz in2=read2.corrected.fq.gz out=contigs.fa.gz

Code:
java.io.EOFException: Unexpected end of ZLIB input stream
        at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
        at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
        at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
        at fileIO.ByteFile1.fillBuffer(ByteFile1.java:217)
        at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
        at fileIO.ByteFile2$BF1Thread.run(ByteFile2.java:264)
open=true

java.io.EOFException: Unexpected end of ZLIB input stream
        at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
        at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
        at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
        at fileIO.ByteFile1.fillBuffer(ByteFile1.java:217)
        at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
        at fileIO.ByteFile2$BF1Thread.run(ByteFile2.java:264)
open=true

Exception in thread "Thread-4" java.lang.AssertionError:
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.
        at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:480)
        at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:345)
        at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:189)
        at java.lang.Thread.run(Thread.java:745)
ycl6 is offline   Reply With Quote
Old 06-30-2016, 03:06 AM   #40
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 799
Default

From the error messages, your input read files are corrupted:

Quote:
...Unexpected end of ZLIB input stream
...Unexpected end of ZLIB input stream
...The pairing may have been corrupted by an upstream process.
Do you have any other program that can read the files to check for errors? At the very least, a simple count of lines would be useful:

Code:
wc -l read1.corrected.fq.gz; wc -l read2.corrected.fq.gz
[numbers should be identical, and this should produce no errors]
gringer is offline   Reply With Quote
Reply

Tags
assembler, bbmap, bbmerge, bbnorm, bbtools, error correction, tadpole

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:45 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO