SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
help using velvet for de novo assembly NGS_New_User Bioinformatics 6 03-11-2015 12:07 AM
De Novo Transcriptome Assembly from Kmers only joeseki De novo discovery 3 01-10-2014 04:55 AM
RNA-Seq de novo assembly Velvet, erros in multi-kmers magarolo RNA Sequencing 0 12-07-2012 06:32 AM
de novo assembly (velvet or others) strob Bioinformatics 1 01-20-2010 05:53 AM
Velvet de novo assembly of Solid reads HOWTO KevinLam De novo discovery 1 01-10-2010 01:11 AM

Reply
 
Thread Tools
Old 02-21-2014, 01:36 PM   #1
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default De novo assembly using Velvet: any idea why such small Kmers with long reads?

Greetings!

I am doing some de novo genome assembly of a 23Mb genome using Velvet 1.2.10 and quality trimmed MiSeq (Illumina) reads that average about 180bp in length. I have assembled different individuals of this species before with 100bp reads and the kmer size always comes in around 61 for the best N50s and very good max. contig size. However with this assmebly I am getting really low kmer sizes as optimal (for N50s) in the the low/mid 30s. Velvet estimated the kmer coverage averaging 23X.

Results here (Kmer size is on the x-axis, the left y-axis is for the max. contig size ("max," red line) and the left y-axis if for the N50s (blue line)):

I am concerned about how good an assembly would be for such a large genome and such small kmer, and I also just wonder why -- with longer reads -- I need smaller kmers.

Thanks!

Last edited by Genomics101; 02-21-2014 at 03:51 PM. Reason: spelling, added kmer coverage detail
Genomics101 is offline   Reply With Quote
Old 02-21-2014, 02:45 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Your coverage is too low. At 23x coverage with 100bp reads and k=60, you'll only get a kmer depth of 40% or around 10, which will give a fragmented assembly missing low-depth areas, and making it hard to distinguish valid and error kmers.

Edit:

Oops, I see now that the old assemblies were 100bp and the new ones are 180bp. Still, the point remains that as you increase K you decrease kmer depth, and you appear to have too little data for that to help. What's the insert size distribution and quality distribution? I've seen a lot of MiSeq libraries get made with insert sizes shorter than read length. So, also, you might consider adapter-trimming based on kmers before quality trimming.

Last edited by Brian Bushnell; 02-21-2014 at 02:57 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-21-2014, 03:40 PM   #3
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Quote:
Originally Posted by Genomics101 View Post
Velvet estimated the coverage averaging 23X.
Are you talking about kmer coverage? or nucleotide coverage?

In other words, your biggest contigs, what is their cov value? (those show kmer coverage)
AdrianP is offline   Reply With Quote
Old 02-21-2014, 03:41 PM   #4
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Kmer coverage
Genomics101 is offline   Reply With Quote
Old 02-21-2014, 03:45 PM   #5
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Quote:
Originally Posted by Genomics101 View Post
Kmer coverage
Okay, than what the person in post #2 said doesn't apply, because they assumed nucleotide coverage. You need to go higher kmers. kmer coverage of higher than 20 is a waste, you need to aim between 10 and 20.

Use VelvetOptimiser, and try kmers to 160-180, you might see a second peak in N50, this is common.
AdrianP is offline   Reply With Quote
Old 02-21-2014, 03:46 PM   #6
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

@Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.
Genomics101 is offline   Reply With Quote
Old 02-21-2014, 03:49 PM   #7
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

@AdrianP Thanks for your reply, but I actually did kmers (with a larger gap between them ) all the way up to 191 as the initial analysis:
Genomics101 is offline   Reply With Quote
Old 02-21-2014, 03:50 PM   #8
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Quote:
Originally Posted by Genomics101 View Post
@Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.
Okay, with a base coverage of 43X you do not want Velvet. DBG needs high coverage for repeat resolution.

My advice, is to use SeqPrep to merge your reads. You should have 3 files, forward, reverse, and merged after using it. Feed those to the MIRA assembler, which is an OLC assembler, and I expect you to get better results.
AdrianP is offline   Reply With Quote
Old 02-21-2014, 03:52 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by Genomics101 View Post
@Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.
Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?
Brian Bushnell is offline   Reply With Quote
Old 02-21-2014, 03:57 PM   #10
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?
To obtain that, I can recommend:
http://kmergenie.bx.psu.edu/

CMD:
./kmergenie --diploid -k <higher_kmer> -e 1 -l <lower_kmer> -t <cpu_threads> -o <output_name> <read_location>

Start with higher 101, and lower 41, see what graphs.
AdrianP is offline   Reply With Quote
Old 02-21-2014, 04:16 PM   #11
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?
Genomics101 is offline   Reply With Quote
Old 02-21-2014, 04:31 PM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Ah - what I actually mean is, a graph for a fixed kmer length (of, say, 31) where the X axis is depth and Y axis is number of kmers found at that depth, both log-scale. Ideally, you should have a sharp peak at some depth (maybe 40) and it should drop dramatically on the left and right.

My attachment shows the 31-mer frequency histogram for e.coli synthetic reads. You can see a main peak at about 200 and a few repeat peaks after that. If the data was real and had uneven coverage, there would be a broad peak rather than a sharp one.

FYI, I generated this with the 'khist.sh' script in the BBMap package and plotted it in Excel.
Attached Images
File Type: jpg ecoli.jpg (76.3 KB, 15 views)

Last edited by Brian Bushnell; 02-21-2014 at 04:33 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-21-2014, 04:39 PM   #13
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

I also have the option of using the longer and more uniform untrimmed reads, but the quality is pretty questionable:



I tried doing an assembly with these and got better N50s at very high kmers (~99-135) but the kmer depth has a weird bell curve relationship with kmer size rather than a direct one. The lowest kmer I tried (45) had a coverage of only 1,2 and it the kmer coverage peaked at 41.5X at kmer =93 (also a relative good N50 at ~20kb). Also, I am very wary of using data with so many errors.
Genomics101 is offline   Reply With Quote
Old 02-21-2014, 04:48 PM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Read length uniformity shouldn't matter to Velvet. Trimming to ~180bp seems like overkill for data of that quality; I would probably try trimming to something very conservative like Q10. Excessive trimming can also cause biases.
Brian Bushnell is offline   Reply With Quote
Old 02-21-2014, 04:54 PM   #15
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Quote:
Originally Posted by Brian Bushnell View Post
Trimming to ~180bp seems like overkill for data of that quality; I would probably try trimming to something very conservative like Q10. .
Thanks. I didn't trim by length, but by quality (Q30 cut off) and the reads just came out with most of them at around. But doing a less strict trimming may be the answer.

Since I have your very helpful attention here, since I am doing the assmebly with the untrimmed reads, do you have a suggestion for a good way to assess how the sequencing errors are affecting the accuracy of the contigs? Should I just BLAST a few regions I have done with Sanger sequencing?
Genomics101 is offline   Reply With Quote
Old 02-21-2014, 05:15 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Well... if there is a reference, I would compare them to the reference. With or without a reference, I think Quast is fairly useful. It tells you the number of predicted genes and so forth in your assembly; generally, the more longer genes you have and the fewer shorter genes, the higher the assembly quality. Most useful if you have more than one assembly. If you have a reference, it will also tell you the number of misassemblies.

Similarly, if you map all of the source reads to the assembly, then the number of non-match symbols (insertions, deletions, substitutions) correlates with assembly quality, as does % of reads mapped, % paired, and % unambiguous mappings. We use BBMap for that purpose here, though again, it's better for determining the relative quality of two assemblies than the absolute quality of one assembly. You can also try ALE, which attempts to judge the probability of an assembly being correct given the reads. This is based on mismatches, deviation of paired insert size from normal, and coverage depth.

If you have EST data, then calculating the EST capture rate is also very useful in determining genome completeness and misassemblies.

Incidentally, I think Q30 is much too high for trimming. Also, what trimming tool are you using? In my tests, the top performing ones use the phred algorithm (this includes seqtk and my own); all other algorithms were consistently inferior. The attached graph shows error rate of reads versus bases remaining from the initial 150Mbp dataset after quality-trimming; each point represents a different quality cutoff (so the leftmost point is trimming to Q40, then Q39, etc). The black line is the best, but that uses error-correction AND quality-trimming; all the others just use quality-trimming. seqtk and bbtrim appear to be identical. The error rates were calculated by mapping back to the Pedobacter reference.
Attached Images
File Type: jpg Clipboard01.jpg (74.0 KB, 98 views)

Last edited by Brian Bushnell; 02-21-2014 at 05:28 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-21-2014, 06:07 PM   #17
Genomics101
Member
 
Location: Maryland, USA

Join Date: May 2012
Posts: 60
Default

Yes, there is a reference that is pretty close, <2% difference at the base level, with a few rearrangements likely. I use Btrim for the trimming.
Genomics101 is offline   Reply With Quote
Old 02-21-2014, 06:12 PM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by Genomics101 View Post
Yes, there is a reference that is pretty close, <2% difference at the base level, with a few rearrangements likely. I use Btrim for the trimming.
Sounds close enough to use Quast to get a very good idea of assembly quality. You can run it with the assembly and reference as the inputs and it will tell you the quality of the assembly with respect to the reference. Even with 2% difference and some rearrangements, the assembly closest to the reference should be the highest quality assembly.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
de novo, kmer, velvet

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:03 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO