![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Introducing BBMerge: A paired-end read merger | Brian Bushnell | Bioinformatics | 118 | 04-13-2018 04:44 AM |
Introducing Reformat, a fast read format converter | Brian Bushnell | Bioinformatics | 22 | 03-20-2018 09:07 AM |
Introducing BBNorm, a read normalization and error-correction tool | Brian Bushnell | Bioinformatics | 45 | 01-13-2017 12:09 AM |
Introducing BBMap, a new short-read aligner for DNA and RNA | Brian Bushnell | Bioinformatics | 24 | 07-07-2014 09:37 AM |
![]() |
|
Thread Tools |
![]() |
#61 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
BBMerge can't force-trim just one read of the pair, sorry. You'd have to have the paired reads in two files, and run BBDuk on the r2 file only, with the flags "ftr2=50 ordered".
"ecct" and extension use kmer data from all reads to do error-correction, so it's best to do that initially, using all reads: Code:
1) [adapter-trim as you have described] 2) bbmerge.sh in1=r1.fq in2=r2.fq out=merged.fq outu=unmerged#.fq ecct extend2=50 k=62 rem qtrim2=r trimq=10,15,20,25 3) bbduk.sh in=unmerged2.fq out=ftrimmed2.fq ftr=50 minlen=0 (make sure the out count is the same as the in count) 4) bbmerge.sh in1=unmerged1.fq in2=ftrimmed2.fq out=mergedB.fq outu=unmergedB#.fq ecct extend2=50 k=62 rem qtrim2=r trimq=10,15,20,25 extra=merged.fq But! It's important to ask yourself whether you want to go to great effort to salvage the lowest-quality fraction of your data. I don't generally use "xloose" because it has a higher false-positive merge rate; in practice, I never go beyond "vloose". In this case since it appears that sequencing read 2 largely failed toward the end, it's worth going to some extent to salvage or else you'll lose the longest-insert reads, but it's also worth considering having the data rerun, since it's clearly a sequencing failure. I'd be interested in seeing the error rates of the reads from mapping (after adapter-trimming): Code:
bbmap.sh in1=r1.fq in2=r2.fq ref=ref.fa mhist=mhist.txt qhist=qhist.txt bhist=bhist.txt |
![]() |
![]() |
![]() |
#62 |
Member
Location: Midwest, USA Join Date: Jan 2016
Posts: 14
|
![]()
Thanks. The data are from a metagenome with no good reference, so mapping to get error rates is not an option. My impression was that most(?) 2x250 HiSeq suffered from problems on the reverse read (including the strange GC divergence problem, which we also see, though maybe it's a separate issue from quality alone). Others have recommended hard-trimming the last 50bp, which inspired my harsh trimming.
I'd be curious to know if JGI or others are having better success with 2x250. We may end up just using the FW read for the poorqual/unmerged, but given the insert size distrib, it seemed worth trying to push the merging a bit further. I find it interesting that extension and xloose did not help as much as simple trimming, which also seems to me to be safer/conservative. Thanks for your help, and for making these tools so modular and customizable. MC |
![]() |
![]() |
![]() |
#63 |
Junior Member
Location: United States Join Date: May 2011
Posts: 5
|
![]()
I'm new to de-novo assembly, so sorry for the naive questions. I used Tadpole for a viral de-novo assembly that had some novel genes inserted and it worked very well. I was able to get ~13 contigs that represented my viral, viral/insert, and insert sequences. In order to get a single final unitig I ended up using Minimus
Should I experiment with k-mer size in Tadpole to see if I can get a single contig? Does a de-novo assembly ever return a single unitig? |
![]() |
![]() |
![]() |
#64 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
De-novo assembly can return single-contig assemblies, but it's data-dependent. Normally, if I pull the PhiX reads from a dataset, Tadpole will assemble them into a single, perfect contig. But it seems like a lot of viruses mutate rapidly and in situations where there are different strains being sequenced together, Tadpole will break up the assembly at points where there are differences. Other assemblers with more sophisticated bubble-collapsing may be able to create a more continuous assembly in those cases; it's worth trying.
For Tadpole, it's beneficial to error-correct the reads (with Tadpole) prior to assembly. And for maximizing contiguity of viral assemblies, more aggressive branch settings of "bm1=8 bm2=2" tend to be useful. And, increasing the kmer length will increase contiguity, as well; I recommend around half the read length, but with high coverage (>200x), you can go up to 2/3 or even 3/4 of read length. Also, if you have paired reads, you can merge them with BBMerge first and then use really long kmers. For example: Code:
bbduk.sh in1=r1.fq in2=r2.fq out=trimmed.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe bbmerge.sh in1=trimmed.fq out=ecco.fq ecco vstrict clumpify.sh in=ecco.fq out=clumped.fq unpair repair ecc passes=4 tadpole.sh in=clumped.fq out=ecct.fq ecc bbmerge-auto.sh in=ecct.fq out=merged.fq outu=unmerged.fq rem extend2=100 k=62 tadpole.sh in=merged.fq,unmerged.fq out=contigs.fa k=93 bm1=8 bm2=2 |
![]() |
![]() |
![]() |
#65 |
Member
Location: East Coast Join Date: Jul 2016
Posts: 37
|
![]()
Hi Brian,
I'm running into a problem where Tadpole is creating two contigs, yet the generated contigs have identical overlapping ends. I've shared the FASTA files and reads via Google Drive. The shared sequence is "AAACTAAAGCTGAGGG" and it is present at the 3' end of the first contig, and the 5' end of the second contig. I ran Tadpole with parameters: K=31 Minimum contig length = 200bp Minimum Coverage = 1 Minimum extension length = 1 My input was 2 x 150 bp reads that had been trimmed and merged (mix=t). Any idea why Tadpole won't combine these two sequences with these parameters? Even if I run just the contigs back into Tadpole, it will not combine them. Thanks Jake |
![]() |
![]() |
![]() |
#66 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
Tadpole may generate contigs that overlap by up to K-1 on the ends when there is a branch in the De Bruijn graph. You can trim these with flag "trimends=15" where the number should be set to K/2, which should prevent overlapping ends.
It won't combine them initially because there was a branch in the graph, and it won't combine them after building them because the overlap is less than kmer length. You might be able to get Minimus2 to combine them if you set the parameters correctly. |
![]() |
![]() |
![]() |
#67 |
Junior Member
Location: Europe Join Date: May 2015
Posts: 8
|
![]()
Hi Brian,
I'm trying to run Tadpole through a PBS on a 200-core NUMA machine and I have requested 24 threads for the task. Even though the log file reports Tadpole is using 24 processes, java seems to spawn all 200 of them. I have never seen this problem with BBmap in the same environment. Can you please help with this? Many thanks in advance! Code:
java -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t Executing assemble.Tadpole2 [-Xmx900g, threads=24, in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz, out=Esu2015_tadpole.fa.gz, k=250, pigz=t, prefilter=2, prepasses=auto, prealloc=t, rinse=t, shave=t] Tadpole version 36.92 Using 24 threads. Executing ukmer.KmerTableSetU [-Xmx900g, threads=24, in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz, out=Esu2015_tadpole.fa.gz, k=250, pigz=t, prefilter=2, prepasses=auto, prealloc=t, rinse=t, shave=t] Set threads to 24 K was changed from 250 to 248 Initial size set to 14420122 Initial: Ways=541, initialSize=14420122, prefilter=t, prealloc=1.0 Memory: max=926102m, free=892279m, used=33823m Initialization Time: 0.133 seconds. |
![]() |
![]() |
![]() |
#68 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
Thanks for this report. This may be an aspect of garbage collection. I noticed that it says "Ways=541", which I would not have expected; it's supposed to set "ways" to be much lower, based on the number of threads. So, I need to look into that - it looks like maybe the number of ways is being determined based on the number of logical CPUs rather than the user's threads setting, but that should not be causing this problem (though you can manually specify "ways=25" if you want).
1) Can you clarify what you mean by "top says 200 for the number of threads to the java process"? I can't actually figure out how to get top to display that information, just the CPU utilization, though Stackoverflow kindly indicated that "ps huH p <PID_OF_U_PROCESS> | wc -l" will work - I tried it, and it does. With "ways=541", memory is split into 541 partitions, and they are allocated in a multithreaded fashion. However, I just checked the code, and allocation still obeys the thread limit the user specifies (so in this case, 24 allocation threads will be used to initialize memory). While processing the reads, only 24 worker threads will be used (plus one or two for reading the input file, so CPU utilization may go to ~2500%). The only thing remaining is garbage collection and other assorted Java threads, over which the user has very little control, but most of the time they are idle. Thus, it's fairly normal for a "singlethreaded" java process to have over 20 threads, which are used by the JVM for things like compilation. 2) Can you give the result when you type "java -Xmx1g -version"? Depending on the version, you might try setting "-XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" or "-XX:-UseSerialGC" and see if that resolves the issue. Personally, I've never seen garbage collection use more than 2300% CPU, though. In order to use those flags, you would need to bypass the shellscript (or add it to the java command near the bottom of the shellscript). So, for example, you could try: java -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t -Brian |
![]() |
![]() |
![]() |
#69 | |||||
Junior Member
Location: Europe Join Date: May 2015
Posts: 8
|
![]() Quote:
this is my field selection in top: there is also the 'P' field, which shows the physical execution core for each thread (when top is in threaded view), and what I see are 200 threads (this time it was 198, and I've noticed the number fluctuates a bit between 160 and 200) being distributed onto 24 cores allocated for this job, meaning that the per-thread cpu consumption is less than optimal (~10%) and most of the threads end up in futex_wait. This is how it looks in threaded view: Quote:
Quote:
![]() Quote:
Code:
java version "1.8.0_20" Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode) Quote:
|
|||||
![]() |
![]() |
![]() |
#70 | |||||
Junior Member
Location: Europe Join Date: May 2015
Posts: 8
|
![]()
Ho-hum, it seems my reply from yesterday never made it to the thread (might be something with the way I linked in the images). Anyway - now I can report more details.
Thanks Brian for the reply! Quote:
Quote:
top5.png you can see the java used 198 threads on 24 allocated processors (cores, actually). I can report thet the thread count varies slightly between runs and drops from initial ~200 to some 150 during the run. But see below for more diagnostics. Also, by pressing H, I am able to switch to threaded view and display each thread in a separate line: top4.png here you can see that most of the threads are in the futex_wait state, and I'm wondering if this is the most optimal way to overtask cores (but I guess this is the Java issue) Quote:
Quote:
Code:
java version "1.8.0_20" Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode) Quote:
the '-XX:-UseSerialGC' parameter had no effect whatsoever on the thread count (and memory use). Thanks again! K |
|||||
![]() |
![]() |
![]() |
#71 |
Junior Member
Location: Europe Join Date: May 2015
Posts: 8
|
![]()
For some reason, two of my posts dod not show up, just a brief answer (to see if it has to do with the attached figures):
1) try pressing f in the main top window, which should give you the list of additional fields to display. Among them should be the nTh - number of threads. Also, pressing H toggles top between thread and normal views. 2) I have: java version "1.8.0_20" Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode) I've done some testing and it seems that reducing GC threads to 1 works in lowering the thread count per allocated core (to roughly 3x the value of threads= parameter). Let's see if this post comes through... |
![]() |
![]() |
![]() |
#72 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 6,699
|
![]() |
![]() |
![]() |
![]() |
#73 |
Junior Member
Location: Europe Join Date: May 2015
Posts: 8
|
![]() |
![]() |
![]() |
![]() |
#74 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 6,699
|
![]() |
![]() |
![]() |
![]() |
#75 |
Junior Member
Location: United States Join Date: May 2011
Posts: 5
|
![]()
Hi Brian,
I have used tadpole to assemble my viral species of interest using MiSeq 2x150 PE reads with very good results. I've gotten contigs >100kb in length and then have been able to merge with with Minimus. However with my current strain I'm analyzing, the longest contig I get is ~3kb. The work flow I do is trim adapter sequences, filter contaminating sequences, merge reads, and then assembly with tadpole. I was able to merge ~80% of my reads with average insert size ~189 Pairs: 3761362 Joined: 3121029 82.976% Ambiguous: 625762 16.637% No Solution: 14571 0.387% Too Short: 0 0.000% Adapters Expected: 711413 9.457% Adapters Found: 708766 9.422% Errors Corrected: 209716 Avg Insert: 188.9 Standard Deviation: 44.6 Mode: 126 I ran tadpole with k=90, I got the following results. Many of the contigs are <500bp. This doesn't really help me much. Is there any diagnostics I can do on my starting data to see what the issue is? Minimum Number Number Total Total Scaffold Scaffold of of Scaffold Contig Contig Length Scaffolds Contigs Length Length Coverage -------- -------------- -------------- -------------- -------------- -------- All 20,729 20,729 2,809,492 2,809,492 100.00% 50 20,729 20,729 2,809,492 2,809,492 100.00% 100 20,729 20,729 2,809,492 2,809,492 100.00% 250 220 220 137,244 137,244 100.00% 500 104 104 95,971 95,971 100.00% 1 KB 28 28 43,011 43,011 100.00% 2.5 KB 1 1 2,998 2,998 100.00% |
![]() |
![]() |
![]() |
#76 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
You might start by looking at a kmer frequency histogram to see if there is a single clear peak, or multiple peaks, or what. Additionally, BLASTing your assembly to nt to see if it is perhaps actually largely contamination from, say, some bacteria (like the host) could be useful. Probably the most likely issue is that you're assembling some low-depth contaminant; another possibility could be that the virus just mutates super-fast. Additionally, it's worth trying another assembler, such as Spades.
|
![]() |
![]() |
![]() |
#77 |
Junior Member
Location: United States Join Date: May 2011
Posts: 5
|
![]()
Hi Brian,
Thanks for your reply. I tried using Spades and got similar results. After speaking with my experimental colleague, I think I may have a mixture of viruses rather than a single one. I ran the program khist.sh on the input fastq file I used for tadpole. khist.sh in=ecco.fq hist=histogram.txt There is a single, strong peak of singleton kmers. Maybe I'm a little slow on the uptake, but if the majority of the kmers are seen only once, this would indicate it would be hard to produce longer contigs, correct? |
![]() |
![]() |
![]() |
#78 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]() Quote:
In your case, you could have a mix of viruses with sequencing depth sufficiently low that none of them will assemble, or a single rapidly-mutating virus, or very low-quality data due to a problem with the sequencing machine... though I'd still suggest that host genome contamination is a possibility. |
|
![]() |
![]() |
![]() |
#79 |
Member
Location: Louisiana Join Date: Nov 2013
Posts: 20
|
![]()
Hi Brian,
Does Tadpole have a checkpoint option, or this a possible feature to add? I ask because I am de novo assembling the axolotl genome using the available raw paired end shotgun sequencing libraries (mate paired libraries have not been released to the public yet nor the shotgun and mate pair assembly until publication). My Tadpole assembly does not finish within 72 hours, which is the maximum walltime for the bigmem queue I am using (48 cores and 1.5TB RAM). The goal of the assembly is not to have the largest N50 possible, rather to map candidate baits (for a sequence capture experiment) developed from transcriptome transcripts and eliminate candidate baits that are non-specific and hybridize to multiple targets. I first performed adapter and quality trimming on each set of the 15 sets of raw paired end reads with BBDuk 37.56 Code:
~/bin/bbmap-37.56/bbduk.sh in1=SRR2027504_1.fastq.gz in2=SRR2027504_2.fastq.gz out1=SRR2027504_1_clean.fastq.gz out2=SRR2027504_2_clean.fastq.gz ref=~/bin/bbmap-37.56/resources/truseq.fa.gz ktrim=r k=23 mink=11 hdist=1 tpe tbo qtrim=rl trimq=15 threads=8 Code:
~/bin/bbmap-37.56/tadpole.sh -Xmx1400g threads=48 prealloc=t \ in1=SRR2027504_1_clean.fastq.gz,SRR2027505_1_clean.fastq.gz,SRR2027506_1_clean.fastq.gz,SRR2027507_1_clean.fastq.gz,SRR2027508_1_clean.fastq.gz,SRR2027509_1_clean.fastq.gz,SRR2027510_1_clean.fastq.gz,SRR2027511_1_clean.fastq.gz,SRR2027512_1_clean.fastq.gz,SRR2027513_1_clean.fastq.gz,SRR2027514_1_clean.fastq.gz,SRR2027515_1_clean.fastq.gz,SRR2027516_1_clean.fastq.gz,SRR2027517_1_clean.fastq.gz,SRR2027518_1_clean.fastq.gz \ in2=SRR2027504_2_clean.fastq.gz,SRR2027505_2_clean.fastq.gz,SRR2027506_2_clean.fastq.gz,SRR2027507_2_clean.fastq.gz,SRR2027508_2_clean.fastq.gz,SRR2027509_2_clean.fastq.gz,SRR2027510_2_clean.fastq.gz,SRR2027511_2_clean.fastq.gz,SRR2027512_2_clean.fastq.gz,SRR2027513_2_clean.fastq.gz,SRR2027514_2_clean.fastq.gz,SRR2027515_2_clean.fastq.gz,SRR2027516_2_clean.fastq.gz,SRR2027517_2_clean.fastq.gz,SRR2027518_2_clean.fastq.gz \ out=axolotl-contigs-k63.fasta mode=contig k=63 Thank you, Gopo |
![]() |
![]() |
![]() |
#80 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 6,699
|
![]() Quote:
Have you tried other large genome assemblers? ALLPATHS-LG? |
|
![]() |
![]() |
![]() |
Tags |
assembler, bbmap, bbmerge, bbnorm, bbtools, error correction, tadpole |
Thread Tools | |
|
|