![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Introducing Reformat, a fast read format converter | Brian Bushnell | Bioinformatics | 39 | 02-03-2021 11:35 AM |
Introducing BBNorm, a read normalization and error-correction tool | Brian Bushnell | Bioinformatics | 53 | 08-12-2020 12:51 PM |
Introducing BBMerge: A paired-end read merger | Brian Bushnell | Bioinformatics | 132 | 06-19-2020 03:15 AM |
Introducing BBMap, a new short-read aligner for DNA and RNA | Brian Bushnell | Bioinformatics | 24 | 07-07-2014 09:37 AM |
![]() |
|
Thread Tools |
![]() |
#41 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
Code:
zcat read1.corrected.fq.gz | wc -l; zcat read2.corrected.fq.gz | wc -l Code:
gzip -t read1.corrected.fq.gz; gzip -t read2.corrected.fq.gz Code:
gzip: x.gz: unexpected end of file |
|
![]() |
![]() |
![]() |
#42 |
Junior Member
Location: Taiwan Join Date: Feb 2013
Posts: 2
|
![]()
Hi gringer, thanks, I didn't realize the multiple fastq files wasn't concatenated properly before inputting them to tadpole. I'll fix it and try again.
|
![]() |
![]() |
![]() |
#43 |
Junior Member
Location: Bremen, Germany Join Date: Dec 2016
Posts: 1
|
![]()
I just saw this comment from Brian Bushnell:
P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K. I recently used Tadpole for contig extension. After reading this comment, I am not sure Tadpole is a reliable tool for this. I have full-length 16S rRNA gene sequences from clone libraries. I took the variable region of 16S rRNA gene sequences (first 300 nt nucleotides) and mapped on metagenome contigs which are taxonomically classified as taxa of interest. These metagenome contigs were assembled from randomly-sheared metagenomic fragment libraries (sequencing platfrom: HiSeq). 300 nt nucleotide of 16S rRNA gene sequence perfectly mapped on a contig. I then run Tadpole to extend this contig using all pair-end reads in this metagenome. I could extend the contig up to 6500 nucleotide. Is Tadpole reliable for such usages? |
![]() |
![]() |
![]() |
#44 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Since you are pulling the kmer information from randomly-sheared metagenomic reads, and using the variable region as a seed, it should be safe. However, there is a possibility of running into a tandem repeat region during extension when you give it a very high extension limit, so I recommend also adding the flag "ibb=f". It probably won't make any difference though.
|
![]() |
![]() |
![]() |
#45 | |
Member
Location: Germany Join Date: Feb 2016
Posts: 40
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#46 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Thanks for noting that; I've corrected it. Originally, for error-correction, you had to say "oute" for the output but I changed that a while ago so the current syntax is "out1" and "out2".
|
![]() |
![]() |
![]() |
#47 |
Member
Location: East Coast Join Date: Jul 2016
Posts: 39
|
![]()
Hi Brian,
I'm getting an error while trying to use tadpole. Tadpole.sh brings up the expected list of options, so I know that I've directed the terminal correctly. However, adding the in=<readsfile.fastq> out=<output directory> doesn't work - I get the following error: java -ea -Xmx15000m -Xms15000m -cp /Users/DJV/programs/bbmap/current/ assemble.Tadpole in=/Users/DJV/Desktop/IN/150-P6-C9 assembled to HXB2 Nested Amplified Region.fastq out=/Users/DJV/Desktop/Out Executing assemble.Tadpole1 [in=/Users/DJV/Desktop/IN/150-P6-C9, assembled, to, HXB2, Nested, Amplified, Region.fastq, out=/Users/DJV/Desktop/Out] Exception in thread "main" java.lang.RuntimeException: Unknown parameter assembled at assemble.Tadpole.<init>(Tadpole.java:447) at assemble.Tadpole1.<init>(Tadpole1.java:66) at assemble.Tadpole.makeTadpole(Tadpole.java:77) at assemble.Tadpole.main(Tadpole.java:64) Here's my Java version information: java version "1.8.0_111" Java(TM) SE Runtime Environment (build 1.8.0_111-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode) Any idea why I'm getting these errors? Thanks Jake |
![]() |
![]() |
![]() |
#48 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Hi Jake,
The problem is the presence of spaces in your filepath. Try adding quotes: Code:
java -ea -Xmx15000m -Xms15000m -cp /Users/DJV/programs/bbmap/current/ assemble.Tadpole in="/Users/DJV/Desktop/IN/150-P6-C9 assembled to HXB2 Nested Amplified Region.fastq" out=/Users/DJV/Desktop/Out/contigs.fa |
![]() |
![]() |
![]() |
#49 | |
Member
Location: East Coast Join Date: Jul 2016
Posts: 39
|
![]() Quote:
Reads are 150x2, and I generally have >100x coverage. I tried increasing to k=60, but that splits the input into many more contigs. My goal here is to extract a consensus sequence for the purpose of annotating ORFs. Previous to this, I mapped the reads to a reference using BBMap, and I'm running into problems with multiple contigs being generated by Tadpole. For instance, in BBMap one of my samples has reads mapped across the entirety of the reference. If I extract the reads that mapped to the reference, and use them to de novo assemble in Tadpole, the output is two contigs (the sum of which equal the length of the reference). Upon inspection of the BBMap file, I find one ambiguity within the reads at the region that Tadpole has split the assembly into two contigs. About 50% of the reads have an "A" in one nucleotide position, while the other half have a "G". My guess is that this 'SNP' was introduced during my PCR amplification (prior to sequencing) or library prep PCR. It doesn't suggest the presence of two viral genomes, because everything else is too homogenous. In my mind, since there is great overlap on both sides of this nucleotide call, I'd rather assemble a single contig, and call an ambiguous base for this position: R. Any idea on how to accomplish this, and do you agree with my thought? I tried adding "shave=f" as a flag, but still no luck. By the way, what does "f" stand for? ![]() |
|
![]() |
![]() |
![]() |
#50 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I don't really know where the high mutation rate comes from in supposedly isolate libraries. Unfortunately, there's no mechanism in Tadpole to fix them - it always halts at a branch and breaks into two contigs, even if it is a single SNP. You can use the "branchmult1" and "branchmult2" flags to adjust this, though. Dropping them to "bm1=6 bm2=2" can often substantially increase continuity. "bm1=6" means it will continue rather than stopping at a branch where the allele ratio of the next base is at least 6:1 rather than at least 20:1 which is the default. With an allele ratio of 1:1 like you have it will never continue, though. Instead, you might try a scaffolding program and gap-filling program that will use paired-end reads to glue the two contigs back together. I have not written one or tested any, but I know several are available.
"rinse" and "shave" can, in some cases, improve continuity by remove very low-depth errors, but that's not the case here. "shave=f" will not do anything, though. "t" and "f" stand for "true" and "false". The default for "shave" is already "false". So, you can try enabling these with "rinse shave" or "rinse=t shave=t" or "rinse=true shave=true", which are all equivalent. Anyway - believe me, I would also prefer in your case for Tadpole to assemble this as a single contig, but it generally won't do that in the case of such evenly-split alleles. Though maybe with "bm1=1.1 bm2=1.05" it would; I'm not really sure what would happen in that case. |
![]() |
![]() |
![]() |
#51 | |
Member
Location: East Coast Join Date: Jul 2016
Posts: 39
|
![]() Quote:
It's almost as if there need to be an "if" rule added that takes into account other nearby basecalls. One mixed basecall per, say 1000 bp, isn't so worrisome, but 1 per 10 certainly is. Perhaps this is what scaffolding does? For now, I'm using CLC to de novo assemble. I'm on a trial license right now...my PI will have to eat the bill if this ends up being the best option. I have a feeling it's gonna cost a pretty penny x_x. Jake |
|
![]() |
![]() |
![]() |
#52 |
Junior Member
Location: Germany Join Date: Oct 2012
Posts: 5
|
![]()
Hello,
is it possible to specify multiple fastq.gz files (paired-end) as input for Tadpole? I try to find a way to circumvent decompression and concatenation. Thanks for any suggestion! Alfons |
![]() |
![]() |
![]() |
#53 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Yes, the input can be comma-delimited. For error-correction:
Code:
tadpole.sh in=a.fq.gz,b.fq.gz out=a_ecc.fq.gz,b_ecc.fq.gz ecc Code:
tadpole.sh in=a.fq.gz,b.fq.gz out=contigs.fa |
![]() |
![]() |
![]() |
#54 |
Member
Location: Midwest, USA Join Date: Jan 2016
Posts: 14
|
![]()
Hi Brian,
I have a very basic question about how merging works with the tadpole extend option. I have 2x250 HiSeq metagenome data, and about 75-80% of the reads merge by overlap after adapter trimming (median insert = 292, so I wasted a lot of sequencing, but that's another story). I've tried loose and vloose, which gained a couple % over default. I'm wondering if the tadpole extend option might help merge a bit more, if we're just missing overlaps by a little bit. But I'm confused about how it works. Is the idea that it uses the rest of the dataset to find joins between my FW and RV read? Are those filled in with N's in a merged fragment (to give an insert size estimate) or do they actually get filled in with sequence from other reads (which I would not want in this case, since I have a complex metagenome)? Thanks very much, MC |
![]() |
![]() |
![]() |
#55 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
So, I would recommend a command would like this:
Code:
bbmerge-auto.sh in1=r1.fq in2=r2.fq out=merged.fq outu=unmerged.fq prefilter=1 extend2=50 k=62 rem adapter=default 1) Reads are processed and kmers are counted approximately, for the prefilter flag. 2) Reads are processed again and kmers are counted exactly, ignoring kmers that only occur once (to save memory). 3) Reads are processed again, and merging occurs: 3a) For each pair, merging is attempted. 3b) If a good overlap is not discovered, each read is extended by up to 50bp on the right end only, and merging is attempted again. If they still don't overlap, the extension is undone. Otherwise, they will be merged. This means that with the flag "extend2=50" you could get up to ~100 bp in the middle that is created from the kmers in other reads. This is not really different from normal assembly; assuming you will assemble this data at some point, this process is going to occur eventually. It's true that this process could result in the formation of chimeric sequence, but even with a complex metagenome, Tadpole has a very low rate of chimeric sequence generation. It will basically only happen if you have two strains of a microbe, one that is over 20x as abundant as the other (you can adjust that '20' with the 'branchmult' flag); in that case, the middle of a pair of reads from the less-abundant strain might get filled with sequence from the more-abundant strain. But in practice this is very rare. With a median insert of 292, it seems unlikely to me that 25% of the reads would be too far apart to overlap (>~490bp insert). Can you post the insert size distribution (you can get it with the flag ihist=ihist.txt)? It's possible there's another reason for the low merging rate, such as a high error rate, in which case error-correction would be more effective than extension. Of course, error-correction also poses the risk of creation of chimeric sequence, but again, at a very low rate. Last edited by Brian Bushnell; 02-01-2017 at 02:54 PM. |
![]() |
![]() |
![]() |
#56 |
Member
Location: Midwest, USA Join Date: Jan 2016
Posts: 14
|
![]()
I tried the command you suggested, and after first running out of memory (with -Xmx54g) and retrying with prealloc=t, now I get this error:
Code:
Initial size set to 19219231 Initial: Ways=71, initialSize=19219231, prefilter=t, prealloc=1.0 Memory: max=55566m, free=54116m, used=1450m Made prefilter: hashes = 2 mem = 8.47 GB cells = 36.38B used = 11.568% Estimated valid kmers: 1095698442 Prefilter time: 213.467 seconds. After prefilter: Memory: max=55566m, free=37242m, used=18324m Estimated kmer capacity: 1228108886 After table allocation: Memory: max=55613m, free=17712m, used=37901m bbmerge-auto.sh: line 60: 35987 Killed Thanks for your help, MC |
![]() |
![]() |
![]() |
#57 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
This appears to be getting killed by the scheduler because it's using more memory than it's allowed to, although usually that happens much faster so I can't be sure. I suggest you reduce the -Xmx argument. However, I don't know what you should reduce it to, since I don't know the memory limit for that node... but you could just try, say, -Xmx50g and see if that works. Otherwise, I suggest you talk to your sysadmin and ask how much memory you are allowed to use on that node, or why the job got killed.
|
![]() |
![]() |
![]() |
#58 |
Member
Location: East Coast Join Date: Jul 2016
Posts: 39
|
![]()
Hi Brian,
I was wondering if there was a way to save the reads that are used during contig assembly, so that they can be mapped back to the generated consensus in a downstream step. I noticed the 'outd=<file>' flag will list discarded reads - anything similar for used reads? I'm quite happy with Tadpole's ability to assemble PCR-amplified viral genomes post-illumina sequencing. I've compared it to Spades and CLC. The one thing I'd like to see what reads are getting used, discarded, etc. Thanks! |
![]() |
![]() |
![]() |
#59 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Tadpole is a purely kmer-based assembler, so by the time assembly is being done, there is no knowledge of which reads were used. Therefore, Tadpole can't provide that output. So, there are two ways to get it.
1) Map the reads to the assembly with BBMap and use the "outm" and "outu" flags to capture mapped and unmapped reads. 2) Use BBDuk to capture reads that share or do not share kmers with the assembly: Code:
bbduk.sh in=reads.fq ref=contigs.fa k=31 mm=f out=unmatched.fq outm=matched.fq |
![]() |
![]() |
![]() |
#60 | |
Member
Location: Midwest, USA Join Date: Jan 2016
Posts: 14
|
![]() Quote:
I've tried to improve my merging rate above ~75% and as you suspected I think poor quality is limiting the merging. Here are the gory details: First I adapter trimmed with bbduk ref=truseq.fa.gz ktrim=r mink=11 hdist=2 k=21 tpe tbo Then bbmerge: 1. default: 76.661% join, median=292 2. adapter=default, qtrim2=r —> 76.672% join, median=292, adapter expected=11900, adapter found=3188 3. -Xmx50g prealloc=t prefilter=1 extend2=50 k=62 rem adapter=default —> 76.564%, median=293, adapter expected=11511, adapter found=2805 4. Focusing on the unmerged reads from (1):
default bbmerge; feed unmerged to qtrim2=r, adapter=default, trimq=10,15,20; feed unmerged to forcetrimright2=50. Result: total merge rate of 86.8%, leaving 3.57M/27M unmerged 6. Combining in one command did worse (order of operations different?): qtrim2=r, trimq=10,15,20, adapter=default, forcetrimright2=50, ecct —> 73.8% total join So it seems like quality trimming and even force trimming have the greatest effect? Questions: - what is the order of operations (if specified in a single command): qtrim2, forcetrimright2, ecct, adapter. Is it better to run iteratively on the unmerged reads? - are trim operations performed even if the merge fails (i.e. are the unmerged output reads trimmed)? - can you force trim from just the R2 read? Any other suggestions for dealing with poor quality, primarily the last 50 bases of the R2 read (in a 2x250 HiSeq run)? Thanks very much, MC |
|
![]() |
![]() |
![]() |
Tags |
assembler, bbmap, bbmerge, bbnorm, bbtools, error correction, tadpole |
Thread Tools | |
|
|