SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBMerge: A paired-end read merger Brian Bushnell Bioinformatics 112 10-14-2017 01:54 PM
Introducing BBNorm, a read normalization and error-correction tool Brian Bushnell Bioinformatics 45 01-13-2017 01:09 AM
Introducing Reformat, a fast read format converter Brian Bushnell Bioinformatics 18 06-15-2016 01:51 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM

Reply
 
Thread Tools
Old 02-05-2017, 10:55 AM   #61
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

BBMerge can't force-trim just one read of the pair, sorry. You'd have to have the paired reads in two files, and run BBDuk on the r2 file only, with the flags "ftr2=50 ordered".

"ecct" and extension use kmer data from all reads to do error-correction, so it's best to do that initially, using all reads:

Code:
1) [adapter-trim as you have described]
2) bbmerge.sh in1=r1.fq in2=r2.fq out=merged.fq outu=unmerged#.fq ecct extend2=50 k=62 rem qtrim2=r trimq=10,15,20,25
3) bbduk.sh in=unmerged2.fq out=ftrimmed2.fq ftr=50 minlen=0
(make sure the out count is the same as the in count)
4) bbmerge.sh in1=unmerged1.fq in2=ftrimmed2.fq out=mergedB.fq outu=unmergedB#.fq ecct extend2=50 k=62 rem qtrim2=r trimq=10,15,20,25 extra=merged.fq
When BBMerge fails to merge reads, it undoes all the operations (error-correction, extension, quality-trimming). Or at least, it's supposed to; I have never actually combined extension, error-correction, and iterative quality-trimming at the same time, so let me know if anything seems odd.

But! It's important to ask yourself whether you want to go to great effort to salvage the lowest-quality fraction of your data. I don't generally use "xloose" because it has a higher false-positive merge rate; in practice, I never go beyond "vloose". In this case since it appears that sequencing read 2 largely failed toward the end, it's worth going to some extent to salvage or else you'll lose the longest-insert reads, but it's also worth considering having the data rerun, since it's clearly a sequencing failure. I'd be interested in seeing the error rates of the reads from mapping (after adapter-trimming):

Code:
bbmap.sh in1=r1.fq in2=r2.fq ref=ref.fa mhist=mhist.txt qhist=qhist.txt bhist=bhist.txt
Brian Bushnell is offline   Reply With Quote
Old 02-05-2017, 12:16 PM   #62
mcmc
Junior Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 7
Default

Thanks. The data are from a metagenome with no good reference, so mapping to get error rates is not an option. My impression was that most(?) 2x250 HiSeq suffered from problems on the reverse read (including the strange GC divergence problem, which we also see, though maybe it's a separate issue from quality alone). Others have recommended hard-trimming the last 50bp, which inspired my harsh trimming.

I'd be curious to know if JGI or others are having better success with 2x250.

We may end up just using the FW read for the poorqual/unmerged, but given the insert size distrib, it seemed worth trying to push the merging a bit further. I find it interesting that extension and xloose did not help as much as simple trimming, which also seems to me to be safer/conservative.

Thanks for your help, and for making these tools so modular and customizable.
MC
mcmc is offline   Reply With Quote
Old 02-08-2017, 10:57 PM   #63
indapa
Junior Member
 
Location: United States

Join Date: May 2011
Posts: 5
Default getting a single unitig from Tadpole

I'm new to de-novo assembly, so sorry for the naive questions. I used Tadpole for a viral de-novo assembly that had some novel genes inserted and it worked very well. I was able to get ~13 contigs that represented my viral, viral/insert, and insert sequences. In order to get a single final unitig I ended up using Minimus

Should I experiment with k-mer size in Tadpole to see if I can get a single contig? Does a de-novo assembly ever return a single unitig?
indapa is offline   Reply With Quote
Old 02-09-2017, 10:12 AM   #64
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

De-novo assembly can return single-contig assemblies, but it's data-dependent. Normally, if I pull the PhiX reads from a dataset, Tadpole will assemble them into a single, perfect contig. But it seems like a lot of viruses mutate rapidly and in situations where there are different strains being sequenced together, Tadpole will break up the assembly at points where there are differences. Other assemblers with more sophisticated bubble-collapsing may be able to create a more continuous assembly in those cases; it's worth trying.

For Tadpole, it's beneficial to error-correct the reads (with Tadpole) prior to assembly. And for maximizing contiguity of viral assemblies, more aggressive branch settings of "bm1=8 bm2=2" tend to be useful. And, increasing the kmer length will increase contiguity, as well; I recommend around half the read length, but with high coverage (>200x), you can go up to 2/3 or even 3/4 of read length. Also, if you have paired reads, you can merge them with BBMerge first and then use really long kmers. For example:

Code:
bbduk.sh in1=r1.fq in2=r2.fq out=trimmed.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe
bbmerge.sh in1=trimmed.fq out=ecco.fq ecco vstrict
clumpify.sh in=ecco.fq out=clumped.fq unpair repair ecc passes=4
tadpole.sh in=clumped.fq out=ecct.fq ecc
bbmerge-auto.sh in=ecct.fq out=merged.fq outu=unmerged.fq rem extend2=100 k=62
tadpole.sh in=merged.fq,unmerged.fq out=contigs.fa k=93 bm1=8 bm2=2
Depending on your read lengths, the exact values for k can be increased. For example, if 80% of your reads merged, and the merged reads end up being 300bp long on average, you could try k=200 for the final assembly.
Brian Bushnell is offline   Reply With Quote
Old 02-28-2017, 06:15 AM   #65
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 37
Default

Hi Brian,

I'm running into a problem where Tadpole is creating two contigs, yet the generated contigs have identical overlapping ends. I've shared the FASTA files and reads via Google Drive. The shared sequence is "AAACTAAAGCTGAGGG" and it is present at the 3' end of the first contig, and the 5' end of the second contig.

I ran Tadpole with parameters:
K=31
Minimum contig length = 200bp
Minimum Coverage = 1
Minimum extension length = 1

My input was 2 x 150 bp reads that had been trimmed and merged (mix=t). Any idea why Tadpole won't combine these two sequences with these parameters? Even if I run just the contigs back into Tadpole, it will not combine them.

Thanks
Jake
JVGen is offline   Reply With Quote
Old 02-28-2017, 10:31 AM   #66
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Tadpole may generate contigs that overlap by up to K-1 on the ends when there is a branch in the De Bruijn graph. You can trim these with flag "trimends=15" where the number should be set to K/2, which should prevent overlapping ends.

It won't combine them initially because there was a branch in the graph, and it won't combine them after building them because the overlap is less than kmer length. You might be able to get Minimus2 to combine them if you set the parameters correctly.
Brian Bushnell is offline   Reply With Quote
Old 03-23-2017, 12:40 PM   #67
Kristian
Junior Member
 
Location: Europe

Join Date: May 2015
Posts: 8
Default Tadpole not honoring threads parameter?

Hi Brian,

I'm trying to run Tadpole through a PBS on a 200-core NUMA machine and I have requested 24 threads for the task. Even though the log file reports Tadpole is using 24 processes, java seems to spawn all 200 of them. I have never seen this problem with BBmap in the same environment. Can you please help with this? Many thanks in advance!

Code:
java -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t
Executing assemble.Tadpole2 [-Xmx900g, threads=24, in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz, out=Esu2015_tadpole.fa.gz, k=250, pigz=t, prefilter=2, prepasses=auto, prealloc=t, rinse=t, shave=t]

Tadpole version 36.92
Using 24 threads.
Executing ukmer.KmerTableSetU [-Xmx900g, threads=24, in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz, out=Esu2015_tadpole.fa.gz, k=250, pigz=t, prefilter=2, prepasses=auto, prealloc=t, rinse=t, shave=t]

Set threads to 24
K was changed from 250 to 248
Initial size set to 14420122
Initial:
Ways=541, initialSize=14420122, prefilter=t, prealloc=1.0
Memory: max=926102m, free=892279m, used=33823m

Initialization Time:            0.133 seconds.
however, top says 200 for the number of threads to the java process...
Kristian is offline   Reply With Quote
Old 03-23-2017, 01:46 PM   #68
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Thanks for this report. This may be an aspect of garbage collection. I noticed that it says "Ways=541", which I would not have expected; it's supposed to set "ways" to be much lower, based on the number of threads. So, I need to look into that - it looks like maybe the number of ways is being determined based on the number of logical CPUs rather than the user's threads setting, but that should not be causing this problem (though you can manually specify "ways=25" if you want).

1) Can you clarify what you mean by "top says 200 for the number of threads to the java process"? I can't actually figure out how to get top to display that information, just the CPU utilization, though Stackoverflow kindly indicated that "ps huH p <PID_OF_U_PROCESS> | wc -l" will work - I tried it, and it does.

With "ways=541", memory is split into 541 partitions, and they are allocated in a multithreaded fashion. However, I just checked the code, and allocation still obeys the thread limit the user specifies (so in this case, 24 allocation threads will be used to initialize memory). While processing the reads, only 24 worker threads will be used (plus one or two for reading the input file, so CPU utilization may go to ~2500%). The only thing remaining is garbage collection and other assorted Java threads, over which the user has very little control, but most of the time they are idle. Thus, it's fairly normal for a "singlethreaded" java process to have over 20 threads, which are used by the JVM for things like compilation.

2) Can you give the result when you type "java -Xmx1g -version"?

Depending on the version, you might try setting "-XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" or "-XX:-UseSerialGC" and see if that resolves the issue. Personally, I've never seen garbage collection use more than 2300% CPU, though. In order to use those flags, you would need to bypass the shellscript (or add it to the java command near the bottom of the shellscript). So, for example, you could try:

java -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t

-Brian
Brian Bushnell is offline   Reply With Quote
Old 03-23-2017, 03:44 PM   #69
Kristian
Junior Member
 
Location: Europe

Join Date: May 2015
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
Thanks for this report. This may be an aspect of garbage collection. I noticed that it says "Ways=541", which I would not have expected; it's supposed to set "ways" to be much lower, based on the number of threads. So, I need to look into that - it looks like maybe the number of ways is being determined based on the number of logical CPUs rather than the user's threads setting, but that should not be causing this problem (though you can manually specify "ways=25" if you want).

1) Can you clarify what you mean by "top says 200 for the number of threads to the java process"? I can't actually figure out how to get top to display that information, just the CPU utilization, though Stackoverflow kindly indicated that "ps huH p <PID_OF_U_PROCESS> | wc -l" will work - I tried it, and it does.
on my system (Suse-based) I can add additional fields to top by pressing f, then seleccting from a list, and one of them is the nThr, which displays the number of threads. Also, by pressing 'H' in the main top, I'm able to toggle between the 'threaded' and 'process' views, where threaded view shows all threads of a process.

this is my field selection in top:


there is also the 'P' field, which shows the physical execution core for each thread (when top is in threaded view), and what I see are 200 threads (this time it was 198, and I've noticed the number fluctuates a bit between 160 and 200) being distributed onto 24 cores allocated for this job, meaning that the per-thread cpu consumption is less than optimal (~10%) and most of the threads end up in futex_wait.

This is how it looks in threaded view:


Quote:
Originally Posted by Brian Bushnell View Post
With "ways=541", memory is split into 541 partitions, and they are allocated in a multithreaded fashion. However, I just checked the code, and allocation still obeys the thread limit the user specifies (so in this case, 24 allocation threads will be used to initialize memory).
Another curiosity - I'm unable to run a k=248 assembly with 900 gigs of memory, would this be related to the 'ways' anomaly?

Quote:
While processing the reads, only 24 worker threads will be used (plus one or two for reading the input file, so CPU utilization may go to ~2500%). The only thing remaining is garbage collection and other assorted Java threads, over which the user has very little control, but most of the time they are idle. Thus, it's fairly normal for a "singlethreaded" java process to have over 20 threads, which are used by the JVM for things like compilation.
the load does add up to ~2500% and this is what's expected, but I'm a bit concerned with the efficiency of java spawning multiple processes (threads) per core where each has a 10% load, rather than 100%. Plus, I'm not sure I've noticed the same behavior with BBmap, so I report my curiosity

Quote:
2) Can you give the result when you type "java -Xmx1g -version"?
Code:
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
Quote:
Depending on the version, you might try setting "-XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" or "-XX:-UseSerialGC" and see if that resolves the issue. Personally, I've never seen garbage collection use more than 2300% CPU, though. In order to use those flags, you would need to bypass the shellscript (or add it to the java command near the bottom of the shellscript). So, for example, you could try:

java -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t

-Brian
Thanks - I'll try and play with the GC parameters to see how they behave!
Attached Images
File Type: png top4.png (132.7 KB, 29 views)
File Type: png top5.png (17.5 KB, 28 views)
Kristian is offline   Reply With Quote
Old 03-24-2017, 11:22 AM   #70
Kristian
Junior Member
 
Location: Europe

Join Date: May 2015
Posts: 8
Default

Ho-hum, it seems my reply from yesterday never made it to the thread (might be something with the way I linked in the images). Anyway - now I can report more details.

Thanks Brian for the reply!

Quote:
Originally Posted by Brian Bushnell View Post
Thanks for this report. This may be an aspect of garbage collection. I noticed that it says "Ways=541", which I would not have expected; it's supposed to set "ways" to be much lower, based on the number of threads. So, I need to look into that - it looks like maybe the number of ways is being determined based on the number of logical CPUs rather than the user's threads setting, but that should not be causing this problem (though you can manually specify "ways=25" if you want).
I forced ways to 25 manually, as you suggested. It did not have any effect on the memory footprint that I've noticed. On the memory note - can you clarify how Tadpole uses available memory as specified through java VM? All of my jobs with k=248 and 1200g allocated through PBS and in JVM have exhausted memory and were killed by the PBS. But when I reduced k to 196, gave JVM 700 g and allocated 800 g in PBS, the job finished. Or is the memory requirement quadratic with k?

Quote:
1) Can you clarify what you mean by "top says 200 for the number of threads to the java process"? I can't actually figure out how to get top to display that information, just the CPU utilization, though Stackoverflow kindly indicated that "ps huH p <PID_OF_U_PROCESS> | wc -l" will work - I tried it, and it does.
My top (Suse-flavoured Linux, so nothing exotic, I guess) has the possibility to add additional columns, by pressing f and then selecting extra columns to display. One of them is the nTh - number of threads. This is how my top looks like:

top5.png

you can see the java used 198 threads on 24 allocated processors (cores, actually). I can report thet the thread count varies slightly between runs and drops from initial ~200 to some 150 during the run. But see below for more diagnostics.

Also, by pressing H, I am able to switch to threaded view and display each thread in a separate line:

top4.png

here you can see that most of the threads are in the futex_wait state, and I'm wondering if this is the most optimal way to overtask cores (but I guess this is the Java issue)

Quote:
With "ways=541", memory is split into 541 partitions, and they are allocated in a multithreaded fashion. However, I just checked the code, and allocation still obeys the thread limit the user specifies (so in this case, 24 allocation threads will be used to initialize memory). While processing the reads, only 24 worker threads will be used (plus one or two for reading the input file, so CPU utilization may go to ~2500%). The only thing remaining is garbage collection and other assorted Java threads, over which the user has very little control, but most of the time they are idle. Thus, it's fairly normal for a "singlethreaded" java process to have over 20 threads, which are used by the JVM for things like compilation.
yes, it seems the GC is the cause for the thread zoo, but one thing I have noticed that the more the threads, the faster the memory gets allocated (and even over-allocated, causing the PBS to kill the job). I would not expect this if the memory is allocated per proper work-thread only.

Quote:
2) Can you give the result when you type "java -Xmx1g -version"?
Code:
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
Quote:
Depending on the version, you might try setting "-XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" or "-XX:-UseSerialGC" and see if that resolves the issue. Personally, I've never seen garbage collection use more than 2300% CPU, though. In order to use those flags, you would need to bypass the shellscript (or add it to the java command near the bottom of the shellscript). So, for example, you could try:

java -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t

-Brian
I have tried it and it seems that forcing the parallel and conc GC threads to 1 reduces thread count to approx 3x the number set with the threads= parameter (in this case some ~60). This also seems to conserve some memory, or at least slow down the increase in the process resident memory footprint (but this I have not really tested, it was just my feeling).

the '-XX:-UseSerialGC' parameter had no effect whatsoever on the thread count (and memory use).

Thanks again!
K
Kristian is offline   Reply With Quote
Old 03-25-2017, 08:49 AM   #71
Kristian
Junior Member
 
Location: Europe

Join Date: May 2015
Posts: 8
Default

For some reason, two of my posts dod not show up, just a brief answer (to see if it has to do with the attached figures):

1) try pressing f in the main top window, which should give you the list of additional fields to display. Among them should be the nTh - number of threads. Also, pressing H toggles top between thread and normal views.

2) I have:
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

I've done some testing and it seems that reducing GC threads to 1 works in lowering the thread count per allocated core (to roughly 3x the value of threads= parameter).

Let's see if this post comes through...
Kristian is offline   Reply With Quote
Old 03-26-2017, 04:29 AM   #72
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,547
Default

Quote:
Originally Posted by Kristian View Post

Let's see if this post comes through...
Your last posts were caught by the moderation filter. I have approved them.
GenoMax is offline   Reply With Quote
Old 03-27-2017, 10:18 AM   #73
Kristian
Junior Member
 
Location: Europe

Join Date: May 2015
Posts: 8
Default

Quote:
Originally Posted by GenoMax View Post
Your last posts were caught by the moderation filter. I have approved them.
Thanks! What was the issue (to avoid them in the future)?
Kristian is offline   Reply With Quote
Old 03-27-2017, 10:54 AM   #74
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,547
Default

Quote:
Originally Posted by Kristian View Post
Thanks! What was the issue (to avoid them in the future)?
Probably the attachments.
GenoMax is offline   Reply With Quote
Old 04-04-2017, 10:23 AM   #75
indapa
Junior Member
 
Location: United States

Join Date: May 2011
Posts: 5
Default

Hi Brian,

I have used tadpole to assemble my viral species of interest using MiSeq 2x150 PE reads with very good results. I've gotten contigs >100kb in length and then have been able to merge with with Minimus. However with my current strain I'm analyzing, the longest contig I get is ~3kb. The work flow I do is trim adapter sequences, filter contaminating sequences, merge reads, and then assembly with tadpole.

I was able to merge ~80% of my reads with average insert size ~189

Pairs: 3761362
Joined: 3121029 82.976%
Ambiguous: 625762 16.637%
No Solution: 14571 0.387%
Too Short: 0 0.000%
Adapters Expected: 711413 9.457%
Adapters Found: 708766 9.422%
Errors Corrected: 209716

Avg Insert: 188.9
Standard Deviation: 44.6
Mode: 126

I ran tadpole with k=90, I got the following results. Many of the contigs are <500bp. This doesn't really help me much. Is there any diagnostics I can do on my starting data to see what the issue is?

Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 20,729 20,729 2,809,492 2,809,492 100.00%
50 20,729 20,729 2,809,492 2,809,492 100.00%
100 20,729 20,729 2,809,492 2,809,492 100.00%
250 220 220 137,244 137,244 100.00%
500 104 104 95,971 95,971 100.00%
1 KB 28 28 43,011 43,011 100.00%
2.5 KB 1 1 2,998 2,998 100.00%
indapa is offline   Reply With Quote
Old 04-04-2017, 05:01 PM   #76
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

You might start by looking at a kmer frequency histogram to see if there is a single clear peak, or multiple peaks, or what. Additionally, BLASTing your assembly to nt to see if it is perhaps actually largely contamination from, say, some bacteria (like the host) could be useful. Probably the most likely issue is that you're assembling some low-depth contaminant; another possibility could be that the virus just mutates super-fast. Additionally, it's worth trying another assembler, such as Spades.
Brian Bushnell is offline   Reply With Quote
Old 04-06-2017, 09:14 AM   #77
indapa
Junior Member
 
Location: United States

Join Date: May 2011
Posts: 5
Default

Hi Brian,

Thanks for your reply. I tried using Spades and got similar results. After speaking with my experimental colleague, I think I may have a mixture of viruses rather than a single one. I ran the program khist.sh on the input fastq file I used for tadpole.

khist.sh in=ecco.fq hist=histogram.txt

There is a single, strong peak of singleton kmers. Maybe I'm a little slow on the uptake, but if the majority of the kmers are seen only once, this would indicate it would be hard to produce longer contigs, correct?
indapa is offline   Reply With Quote
Old 04-06-2017, 10:12 AM   #78
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by indapa View Post
There is a single, strong peak of singleton kmers. Maybe I'm a little slow on the uptake, but if the majority of the kmers are seen only once, this would indicate it would be hard to produce longer contigs, correct?
That's correct. However, most datasets have a strong peak for singleton kmers, due to sequencing error. It's just that typically, for isolates, there is also an obvious higher peak (at, say, 40x when you have 40-fold genomic coverage).

In your case, you could have a mix of viruses with sequencing depth sufficiently low that none of them will assemble, or a single rapidly-mutating virus, or very low-quality data due to a problem with the sequencing machine... though I'd still suggest that host genome contamination is a possibility.
Brian Bushnell is offline   Reply With Quote
Old 10-13-2017, 01:37 AM   #79
Gopo
Member
 
Location: Louisiana

Join Date: Nov 2013
Posts: 15
Default

Hi Brian,

Does Tadpole have a checkpoint option, or this a possible feature to add? I ask because I am de novo assembling the axolotl genome using the available raw paired end shotgun sequencing libraries (mate paired libraries have not been released to the public yet nor the shotgun and mate pair assembly until publication). My Tadpole assembly does not finish within 72 hours, which is the maximum walltime for the bigmem queue I am using (48 cores and 1.5TB RAM).

The goal of the assembly is not to have the largest N50 possible, rather to map candidate baits (for a sequence capture experiment) developed from transcriptome transcripts and eliminate candidate baits that are non-specific and hybridize to multiple targets.

I first performed adapter and quality trimming on each set of the 15 sets of raw paired end reads with BBDuk 37.56
Code:
~/bin/bbmap-37.56/bbduk.sh in1=SRR2027504_1.fastq.gz in2=SRR2027504_2.fastq.gz out1=SRR2027504_1_clean.fastq.gz out2=SRR2027504_2_clean.fastq.gz ref=~/bin/bbmap-37.56/resources/truseq.fa.gz ktrim=r k=23 mink=11 hdist=1 tpe tbo qtrim=rl trimq=15 threads=8
Then used Tapole 37.56 to de novo assemble them
Code:
~/bin/bbmap-37.56/tadpole.sh -Xmx1400g threads=48 prealloc=t \
in1=SRR2027504_1_clean.fastq.gz,SRR2027505_1_clean.fastq.gz,SRR2027506_1_clean.fastq.gz,SRR2027507_1_clean.fastq.gz,SRR2027508_1_clean.fastq.gz,SRR2027509_1_clean.fastq.gz,SRR2027510_1_clean.fastq.gz,SRR2027511_1_clean.fastq.gz,SRR2027512_1_clean.fastq.gz,SRR2027513_1_clean.fastq.gz,SRR2027514_1_clean.fastq.gz,SRR2027515_1_clean.fastq.gz,SRR2027516_1_clean.fastq.gz,SRR2027517_1_clean.fastq.gz,SRR2027518_1_clean.fastq.gz \
in2=SRR2027504_2_clean.fastq.gz,SRR2027505_2_clean.fastq.gz,SRR2027506_2_clean.fastq.gz,SRR2027507_2_clean.fastq.gz,SRR2027508_2_clean.fastq.gz,SRR2027509_2_clean.fastq.gz,SRR2027510_2_clean.fastq.gz,SRR2027511_2_clean.fastq.gz,SRR2027512_2_clean.fastq.gz,SRR2027513_2_clean.fastq.gz,SRR2027514_2_clean.fastq.gz,SRR2027515_2_clean.fastq.gz,SRR2027516_2_clean.fastq.gz,SRR2027517_2_clean.fastq.gz,SRR2027518_2_clean.fastq.gz \
out=axolotl-contigs-k63.fasta mode=contig k=63
If a checkpoint option is not possible with Tadpole, can you recommend a de novo assembler that supports checkpoint?

Thank you,
Gopo
Gopo is offline   Reply With Quote
Old 10-13-2017, 10:34 AM   #80
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,547
Default

Quote:
The axolotl genome consists of 14 chromosome pairs (2N = 28)14 and estimates of its physical size range from 21–48 gigabases
I am not sure if Tadpole is designed to assemble a genome of this size. Curious to see what @Brian has to say.

Have you tried other large genome assemblers? ALLPATHS-LG?
GenoMax is offline   Reply With Quote
Reply

Tags
assembler, bbmap, bbmerge, bbnorm, bbtools, error correction, tadpole

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:29 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO