SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBMerge: A paired-end read merger Brian Bushnell Bioinformatics 112 10-14-2017 01:54 PM
Introducing BBNorm, a read normalization and error-correction tool Brian Bushnell Bioinformatics 45 01-13-2017 01:09 AM
Introducing Reformat, a fast read format converter Brian Bushnell Bioinformatics 18 06-15-2016 01:51 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM

Reply
 
Thread Tools
Old 06-30-2016, 10:15 AM   #41
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by gringer View Post
From the error messages, your input read files are corrupted:



Do you have any other program that can read the files to check for errors? At the very least, a simple count of lines would be useful:

Code:
wc -l read1.corrected.fq.gz; wc -l read2.corrected.fq.gz
[numbers should be identical, and this should produce no errors]
Gringer is correct; the files appear to be corrupt. But just to add a small correction, it does not look like wc can handle gzipped files, so I'd suggest either:

Code:
zcat read1.corrected.fq.gz | wc -l; zcat read2.corrected.fq.gz | wc -l
...which I have no idea how it would behave on a corrupt file, but should produce a correct answer for a correct file. Or this:
Code:
gzip -t read1.corrected.fq.gz; gzip -t read2.corrected.fq.gz
...which will test the integrity of the files. It seems to print nothing if the file is OK; if the file is truncated, it will print:
Code:
gzip: x.gz: unexpected end of file
Probably, the process generating the corrected files (probably Tadpole) was killed or timed out before it was finished.
Brian Bushnell is offline   Reply With Quote
Old 06-30-2016, 05:00 PM   #42
ycl6
Junior Member
 
Location: Taiwan

Join Date: Feb 2013
Posts: 2
Default

Hi gringer, thanks, I didn't realize the multiple fastq files wasn't concatenated properly before inputting them to tadpole. I'll fix it and try again.
ycl6 is offline   Reply With Quote
Old 12-01-2016, 07:22 AM   #43
bavci
Junior Member
 
Location: Bremen, Germany

Join Date: Dec 2016
Posts: 1
Default

I just saw this comment from Brian Bushnell:

P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.

I recently used Tadpole for contig extension. After reading this comment, I am not sure Tadpole is a reliable tool for this. I have full-length 16S rRNA gene sequences from clone libraries. I took the variable region of 16S rRNA gene sequences (first 300 nt nucleotides) and mapped on metagenome contigs which are taxonomically classified as taxa of interest. These metagenome contigs were assembled from randomly-sheared metagenomic fragment libraries (sequencing platfrom: HiSeq). 300 nt nucleotide of 16S rRNA gene sequence perfectly mapped on a contig. I then run Tadpole to extend this contig using all pair-end reads in this metagenome. I could extend the contig up to 6500 nucleotide.

Is Tadpole reliable for such usages?
bavci is offline   Reply With Quote
Old 12-01-2016, 10:24 AM   #44
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Since you are pulling the kmer information from randomly-sheared metagenomic reads, and using the variable region as a seed, it should be safe. However, there is a possibility of running into a tandem repeat region during extension when you give it a very high extension limit, so I recommend also adding the flag "ibb=f". It probably won't make any difference though.
Brian Bushnell is offline   Reply With Quote
Old 12-16-2016, 04:04 AM   #45
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Quote:
Originally Posted by Brian Bushnell View Post
BBTools generally don't care whether paired read input is interleaved or in 2 files, so you don't need to explicitly interleave them. For example, either of these:

tadpole.sh mode=correct in=reads.fq out=corrected.fq

tadpole.sh mode=correct in1=read1.fq in2=read2.fq oute1=corrected1.fq oute2=corrected2.fq
For output, this is out1 and out2 or oute1 and oute2 (I don't see oute in the manual).
moistplus is offline   Reply With Quote
Old 12-16-2016, 08:43 AM   #46
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Thanks for noting that; I've corrected it. Originally, for error-correction, you had to say "oute" for the output but I changed that a while ago so the current syntax is "out1" and "out2".
Brian Bushnell is offline   Reply With Quote
Old 01-17-2017, 12:52 PM   #47
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 37
Default

Hi Brian,

I'm getting an error while trying to use tadpole. Tadpole.sh brings up the expected list of options, so I know that I've directed the terminal correctly.

However, adding the in=<readsfile.fastq> out=<output directory> doesn't work - I get the following error:


java -ea -Xmx15000m -Xms15000m -cp /Users/DJV/programs/bbmap/current/ assemble.Tadpole in=/Users/DJV/Desktop/IN/150-P6-C9 assembled to HXB2 Nested Amplified Region.fastq out=/Users/DJV/Desktop/Out
Executing assemble.Tadpole1 [in=/Users/DJV/Desktop/IN/150-P6-C9, assembled, to, HXB2, Nested, Amplified, Region.fastq, out=/Users/DJV/Desktop/Out]

Exception in thread "main" java.lang.RuntimeException: Unknown parameter assembled
at assemble.Tadpole.<init>(Tadpole.java:447)
at assemble.Tadpole1.<init>(Tadpole1.java:66)
at assemble.Tadpole.makeTadpole(Tadpole.java:77)
at assemble.Tadpole.main(Tadpole.java:64)

Here's my Java version information:
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)


Any idea why I'm getting these errors?

Thanks
Jake
JVGen is offline   Reply With Quote
Old 01-17-2017, 01:07 PM   #48
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Jake,

The problem is the presence of spaces in your filepath. Try adding quotes:

Code:
java -ea -Xmx15000m -Xms15000m -cp /Users/DJV/programs/bbmap/current/ assemble.Tadpole in="/Users/DJV/Desktop/IN/150-P6-C9 assembled to HXB2 Nested Amplified Region.fastq" out=/Users/DJV/Desktop/Out/contigs.fa
Also, you need to add a filename to the output (e.g. /Users/DJV/Desktop/Out/contigs.fa). If using quotes does not work, try renaming the file with underscores rather than spaces. Do you know what your read length and approximate depth are? Tadpole's default kmer length is 31, but with sufficient depth and read length, you will get a better assembly with a longer kmer.
Brian Bushnell is offline   Reply With Quote
Old 01-17-2017, 01:58 PM   #49
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 37
Default

Quote:
Originally Posted by Brian Bushnell View Post
Do you know what your read length and approximate depth are? Tadpole's default kmer length is 31, but with sufficient depth and read length, you will get a better assembly with a longer kmer.
Hi Brian,

Reads are 150x2, and I generally have >100x coverage. I tried increasing to k=60, but that splits the input into many more contigs.

My goal here is to extract a consensus sequence for the purpose of annotating ORFs. Previous to this, I mapped the reads to a reference using BBMap, and I'm running into problems with multiple contigs being generated by Tadpole. For instance, in BBMap one of my samples has reads mapped across the entirety of the reference. If I extract the reads that mapped to the reference, and use them to de novo assemble in Tadpole, the output is two contigs (the sum of which equal the length of the reference).

Upon inspection of the BBMap file, I find one ambiguity within the reads at the region that Tadpole has split the assembly into two contigs. About 50% of the reads have an "A" in one nucleotide position, while the other half have a "G". My guess is that this 'SNP' was introduced during my PCR amplification (prior to sequencing) or library prep PCR. It doesn't suggest the presence of two viral genomes, because everything else is too homogenous. In my mind, since there is great overlap on both sides of this nucleotide call, I'd rather assemble a single contig, and call an ambiguous base for this position: R.

Any idea on how to accomplish this, and do you agree with my thought? I tried adding "shave=f" as a flag, but still no luck. By the way, what does "f" stand for?


JVGen is offline   Reply With Quote
Old 01-17-2017, 03:34 PM   #50
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I don't really know where the high mutation rate comes from in supposedly isolate libraries. Unfortunately, there's no mechanism in Tadpole to fix them - it always halts at a branch and breaks into two contigs, even if it is a single SNP. You can use the "branchmult1" and "branchmult2" flags to adjust this, though. Dropping them to "bm1=6 bm2=2" can often substantially increase continuity. "bm1=6" means it will continue rather than stopping at a branch where the allele ratio of the next base is at least 6:1 rather than at least 20:1 which is the default. With an allele ratio of 1:1 like you have it will never continue, though. Instead, you might try a scaffolding program and gap-filling program that will use paired-end reads to glue the two contigs back together. I have not written one or tested any, but I know several are available.

"rinse" and "shave" can, in some cases, improve continuity by remove very low-depth errors, but that's not the case here. "shave=f" will not do anything, though. "t" and "f" stand for "true" and "false". The default for "shave" is already "false". So, you can try enabling these with "rinse shave" or "rinse=t shave=t" or "rinse=true shave=true", which are all equivalent.

Anyway - believe me, I would also prefer in your case for Tadpole to assemble this as a single contig, but it generally won't do that in the case of such evenly-split alleles. Though maybe with "bm1=1.1 bm2=1.05" it would; I'm not really sure what would happen in that case.
Brian Bushnell is offline   Reply With Quote
Old 01-17-2017, 03:46 PM   #51
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 37
Default

Quote:
Originally Posted by Brian Bushnell View Post
I don't really know where the high mutation rate comes from in supposedly isolate libraries. Unfortunately, there's no mechanism in Tadpole to fix them - it always halts at a branch and breaks into two contigs, even if it is a single SNP. You can use the "branchmult1" and "branchmult2" flags to adjust this, though. Dropping them to "bm1=6 bm2=2" can often substantially increase continuity. "bm1=6" means it will continue rather than stopping at a branch where the allele ratio of the next base is at least 6:1 rather than at least 20:1 which is the default. With an allele ratio of 1:1 like you have it will never continue, though. Instead, you might try a scaffolding program and gap-filling program that will use paired-end reads to glue the two contigs back together. I have not written one or tested any, but I know several are available.

"rinse" and "shave" can, in some cases, improve continuity by remove very low-depth errors, but that's not the case here. "shave=f" will not do anything, though. "t" and "f" stand for "true" and "false". The default for "shave" is already "false". So, you can try enabling these with "rinse shave" or "rinse=t shave=t" or "rinse=true shave=true", which are all equivalent.

Anyway - believe me, I would also prefer in your case for Tadpole to assemble this as a single contig, but it generally won't do that in the case of such evenly-split alleles. Though maybe with "bm1=1.1 bm2=1.05" it would; I'm not really sure what would happen in that case.
Thank you for the input, Brian. I tried bm1=8 bm2=2 earlier, as per info available on JGI's website. It unfortunately still splits, though I haven't tried bm1=6 yet. Of course, this is an instance where we want the contigs to be merged, but, in other cases such lenient branching parameters wouldn't be ideal. For instance, if in the above example there were more 'SNPs' present on either side of this single mixed call. In that case, we may have a mixed reaction.

It's almost as if there need to be an "if" rule added that takes into account other nearby basecalls. One mixed basecall per, say 1000 bp, isn't so worrisome, but 1 per 10 certainly is. Perhaps this is what scaffolding does?

For now, I'm using CLC to de novo assemble. I'm on a trial license right now...my PI will have to eat the bill if this ends up being the best option. I have a feeling it's gonna cost a pretty penny x_x.

Jake
JVGen is offline   Reply With Quote
Old 01-27-2017, 01:12 AM   #52
seq4franken
Junior Member
 
Location: Germany

Join Date: Oct 2012
Posts: 5
Default multiple fastq.gz for Tadpole

Hello,
is it possible to specify multiple fastq.gz files (paired-end) as input for Tadpole?
I try to find a way to circumvent decompression and concatenation.

Thanks for any suggestion!
Alfons
seq4franken is offline   Reply With Quote
Old 01-27-2017, 08:16 AM   #53
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Yes, the input can be comma-delimited. For error-correction:

Code:
tadpole.sh in=a.fq.gz,b.fq.gz out=a_ecc.fq.gz,b_ecc.fq.gz ecc
For assembly:

Code:
tadpole.sh in=a.fq.gz,b.fq.gz out=contigs.fa
Note, incidentally, that gzipped files can be concatenated without decompressing them first, and the result is still valid.
Brian Bushnell is offline   Reply With Quote
Old 02-01-2017, 11:45 AM   #54
mcmc
Junior Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 8
Default

Hi Brian,
I have a very basic question about how merging works with the tadpole extend option. I have 2x250 HiSeq metagenome data, and about 75-80% of the reads merge by overlap after adapter trimming (median insert = 292, so I wasted a lot of sequencing, but that's another story). I've tried loose and vloose, which gained a couple % over default.

I'm wondering if the tadpole extend option might help merge a bit more, if we're just missing overlaps by a little bit. But I'm confused about how it works. Is the idea that it uses the rest of the dataset to find joins between my FW and RV read? Are those filled in with N's in a merged fragment (to give an insert size estimate) or do they actually get filled in with sequence from other reads (which I would not want in this case, since I have a complex metagenome)?

Thanks very much,
MC
mcmc is offline   Reply With Quote
Old 02-01-2017, 03:51 PM   #55
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

So, I would recommend a command would like this:

Code:
bbmerge-auto.sh in1=r1.fq in2=r2.fq out=merged.fq outu=unmerged.fq prefilter=1 extend2=50 k=62 rem adapter=default
This operates in 3 phases.

1) Reads are processed and kmers are counted approximately, for the prefilter flag.
2) Reads are processed again and kmers are counted exactly, ignoring kmers that only occur once (to save memory).
3) Reads are processed again, and merging occurs:
3a) For each pair, merging is attempted.
3b) If a good overlap is not discovered, each read is extended by up to 50bp on the right end only, and merging is attempted again. If they still don't overlap, the extension is undone. Otherwise, they will be merged.

This means that with the flag "extend2=50" you could get up to ~100 bp in the middle that is created from the kmers in other reads. This is not really different from normal assembly; assuming you will assemble this data at some point, this process is going to occur eventually. It's true that this process could result in the formation of chimeric sequence, but even with a complex metagenome, Tadpole has a very low rate of chimeric sequence generation. It will basically only happen if you have two strains of a microbe, one that is over 20x as abundant as the other (you can adjust that '20' with the 'branchmult' flag); in that case, the middle of a pair of reads from the less-abundant strain might get filled with sequence from the more-abundant strain. But in practice this is very rare.

With a median insert of 292, it seems unlikely to me that 25% of the reads would be too far apart to overlap (>~490bp insert). Can you post the insert size distribution (you can get it with the flag ihist=ihist.txt)? It's possible there's another reason for the low merging rate, such as a high error rate, in which case error-correction would be more effective than extension. Of course, error-correction also poses the risk of creation of chimeric sequence, but again, at a very low rate.

Last edited by Brian Bushnell; 02-01-2017 at 03:54 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-02-2017, 10:57 AM   #56
mcmc
Junior Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 8
Default

I tried the command you suggested, and after first running out of memory (with -Xmx54g) and retrying with prealloc=t, now I get this error:

Code:
Initial size set to 19219231
Initial:
Ways=71, initialSize=19219231, prefilter=t, prealloc=1.0
Memory: max=55566m, free=54116m, used=1450m

Made prefilter:         hashes = 2       mem = 8.47 GB          cells = 36.38B          used = 11.568%
Estimated valid kmers:          1095698442
Prefilter time: 213.467 seconds.
After prefilter:
Memory: max=55566m, free=37242m, used=18324m

Estimated kmer capacity:        1228108886
After table allocation:
Memory: max=55613m, free=17712m, used=37901m

bbmerge-auto.sh: line 60: 35987 Killed
Is this still a memory issue?

Thanks for your help,
MC
mcmc is offline   Reply With Quote
Old 02-02-2017, 11:37 AM   #57
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

This appears to be getting killed by the scheduler because it's using more memory than it's allowed to, although usually that happens much faster so I can't be sure. I suggest you reduce the -Xmx argument. However, I don't know what you should reduce it to, since I don't know the memory limit for that node... but you could just try, say, -Xmx50g and see if that works. Otherwise, I suggest you talk to your sysadmin and ask how much memory you are allowed to use on that node, or why the job got killed.
Brian Bushnell is offline   Reply With Quote
Old 02-03-2017, 09:03 AM   #58
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 37
Default

Hi Brian,

I was wondering if there was a way to save the reads that are used during contig assembly, so that they can be mapped back to the generated consensus in a downstream step. I noticed the 'outd=<file>' flag will list discarded reads - anything similar for used reads?

I'm quite happy with Tadpole's ability to assemble PCR-amplified viral genomes post-illumina sequencing. I've compared it to Spades and CLC. The one thing I'd like to see what reads are getting used, discarded, etc.

Thanks!
JVGen is offline   Reply With Quote
Old 02-03-2017, 09:24 AM   #59
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Tadpole is a purely kmer-based assembler, so by the time assembly is being done, there is no knowledge of which reads were used. Therefore, Tadpole can't provide that output. So, there are two ways to get it.

1) Map the reads to the assembly with BBMap and use the "outm" and "outu" flags to capture mapped and unmapped reads.
2) Use BBDuk to capture reads that share or do not share kmers with the assembly:

Code:
bbduk.sh in=reads.fq ref=contigs.fa k=31 mm=f out=unmatched.fq outm=matched.fq
Brian Bushnell is offline   Reply With Quote
Old 02-05-2017, 12:15 AM   #60
mcmc
Junior Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
This appears to be getting killed by the scheduler because it's using more memory than it's allowed to, although usually that happens much faster so I can't be sure. I suggest you reduce the -Xmx argument. However, I don't know what you should reduce it to, since I don't know the memory limit for that node... but you could just try, say, -Xmx50g and see if that works. Otherwise, I suggest you talk to your sysadmin and ask how much memory you are allowed to use on that node, or why the job got killed.
Indeed, reducing it to -Xmx50g solved the problem (on a 64g exclusive node).

I've tried to improve my merging rate above ~75% and as you suspected I think poor quality is limiting the merging. Here are the gory details:

First I adapter trimmed with bbduk ref=truseq.fa.gz ktrim=r mink=11 hdist=2 k=21 tpe tbo

Then bbmerge:
1. default: 76.661% join, median=292
2. adapter=default, qtrim2=r > 76.672% join, median=292, adapter expected=11900, adapter found=3188
3. -Xmx50g prealloc=t prefilter=1 extend2=50 k=62 rem adapter=default > 76.564%, median=293, adapter expected=11511, adapter found=2805
4. Focusing on the unmerged reads from (1):
  • 4a. qtrim2=r trimq=10,15,20 adapter=default > 34.5% (of unmerged) now join, median=322
  • 4b. qtrim2=r trimq=10,15,20,25 adapter=default forcetrimright2=25 > 41.15% (of unmerged) now join, median=321
  • 4c. ecct qtrim2=r adapter=default > 10.40% (of unmerged) now join, median=339
  • 4d. xloose > 33.6% (of unmerged) to join
5. Try sequential merging:
default bbmerge; feed unmerged to qtrim2=r, adapter=default, trimq=10,15,20; feed unmerged to forcetrimright2=50.
Result: total merge rate of 86.8%, leaving 3.57M/27M unmerged
6. Combining in one command did worse (order of operations different?): qtrim2=r, trimq=10,15,20, adapter=default, forcetrimright2=50, ecct > 73.8% total join

So it seems like quality trimming and even force trimming have the greatest effect?

Questions:
- what is the order of operations (if specified in a single command): qtrim2, forcetrimright2, ecct, adapter. Is it better to run iteratively on the unmerged reads?
- are trim operations performed even if the merge fails (i.e. are the unmerged output reads trimmed)?
- can you force trim from just the R2 read?

Any other suggestions for dealing with poor quality, primarily the last 50 bases of the R2 read (in a 2x250 HiSeq run)?

Thanks very much,
MC
Attached Images
File Type: png insert_size_hist_default.png (19.0 KB, 2 views)
mcmc is offline   Reply With Quote
Reply

Tags
assembler, bbmap, bbmerge, bbnorm, bbtools, error correction, tadpole

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:25 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO