Unconfigured Ad

**Brian Bushnell** · 02-21-2014, 02:45 PM

Your coverage is too low. At 23x coverage with 100bp reads and k=60, you'll only get a kmer depth of 40% or around 10, which will give a fragmented assembly missing low-depth areas, and making it hard to distinguish valid and error kmers.

Edit:

Oops, I see now that the old assemblies were 100bp and the new ones are 180bp. Still, the point remains that as you increase K you decrease kmer depth, and you appear to have too little data for that to help. What's the insert size distribution and quality distribution? I've seen a lot of MiSeq libraries get made with insert sizes shorter than read length. So, also, you might consider adapter-trimming based on kmers before quality trimming.

**AdrianP** · 02-21-2014, 03:40 PM

Originally posted by Genomics101 View Post

Velvet estimated the coverage averaging 23X.

Are you talking about kmer coverage? or nucleotide coverage?

In other words, your biggest contigs, what is their cov value? (those show kmer coverage)

**Genomics101** · 02-21-2014, 03:41 PM

Kmer coverage

**AdrianP** · 02-21-2014, 03:45 PM

Originally posted by Genomics101 View Post

Kmer coverage

Okay, than what the person in post #2 said doesn't apply, because they assumed nucleotide coverage. You need to go higher kmers. kmer coverage of higher than 20 is a waste, you need to aim between 10 and 20.

Use VelvetOptimiser, and try kmers to 160-180, you might see a second peak in N50, this is common.

**Genomics101** · 02-21-2014, 03:46 PM

@Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.

**Genomics101** · 02-21-2014, 03:49 PM

@AdrianP Thanks for your reply, but I actually did kmers (with a larger gap between them ) all the way up to 191 as the initial analysis:

**AdrianP** · 02-21-2014, 03:50 PM

Originally posted by Genomics101 View Post

@Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.

Okay, with a base coverage of 43X you do not want Velvet. DBG needs high coverage for repeat resolution.

My advice, is to use SeqPrep to merge your reads. You should have 3 files, forward, reverse, and merged after using it. Feed those to the MIRA assembler, which is an OLC assembler, and I expect you to get better results.

**Brian Bushnell** · 02-21-2014, 03:52 PM

Originally posted by Genomics101 View Post

@Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.

Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?

**AdrianP** · 02-21-2014, 03:57 PM

Originally posted by Brian Bushnell View Post

Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?

To obtain that, I can recommend:

KmerGenie

http://kmergenie.bx.psu.edu/

CMD:
./kmergenie --diploid -k <higher_kmer> -e 1 -l <lower_kmer> -t <cpu_threads> -o <output_name> <read_location>

Start with higher 101, and lower 41, see what graphs.

**Genomics101** · 02-21-2014, 04:16 PM

Originally posted by Brian Bushnell View Post

Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?

**Brian Bushnell** · 02-21-2014, 04:31 PM

Ah - what I actually mean is, a graph for a fixed kmer length (of, say, 31) where the X axis is depth and Y axis is number of kmers found at that depth, both log-scale. Ideally, you should have a sharp peak at some depth (maybe 40) and it should drop dramatically on the left and right.

My attachment shows the 31-mer frequency histogram for e.coli synthetic reads. You can see a main peak at about 200 and a few repeat peaks after that. If the data was real and had uneven coverage, there would be a broad peak rather than a sharp one.

FYI, I generated this with the 'khist.sh' script in the BBMap package and plotted it in Excel.

Attached Files

ecoli.jpg (76.3 KB, 189 views)

**Genomics101** · 02-21-2014, 04:39 PM

I also have the option of using the longer and more uniform untrimmed reads, but the quality is pretty questionable:

I tried doing an assembly with these and got better N50s at very high kmers (~99-135) but the kmer depth has a weird bell curve relationship with kmer size rather than a direct one. The lowest kmer I tried (45) had a coverage of only 1,2 and it the kmer coverage peaked at 41.5X at kmer =93 (also a relative good N50 at ~20kb). Also, I am very wary of using data with so many errors.

**Brian Bushnell** · 02-21-2014, 04:48 PM

Read length uniformity shouldn't matter to Velvet. Trimming to ~180bp seems like overkill for data of that quality; I would probably try trimming to something very conservative like Q10. Excessive trimming can also cause biases.

**Genomics101** · 02-21-2014, 04:54 PM

Originally posted by Brian Bushnell View Post

Trimming to ~180bp seems like overkill for data of that quality; I would probably try trimming to something very conservative like Q10. .

Thanks. I didn't trim by length, but by quality (Q30 cut off) and the reads just came out with most of them at around. But doing a less strict trimming may be the answer.

Since I have your very helpful attention here, since I am doing the assmebly with the untrimmed reads, do you have a suggestion for a good way to assess how the sequencing errors are affecting the accuracy of the contigs? Should I just BLAST a few regions I have done with Sanger sequencing?

Topics	Statistics	Last Post
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, Today, 08:59 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Today, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM

Unconfigured Ad

De novo assembly using Velvet: any idea why such small Kmers with long reads?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News