Seqanswers Leaderboard Ad

**bioinfosm** · 04-20-2009, 08:00 AM

I think that velvet works best with some sort of depth of coverage. Did you try using fewer of the 72bp reads to assemble contigs. If using half the 72bp reads gives you the expected number of contigs, then the issue is not long reads, but too much sequence causing very high depth of coverage

**Melissa** · 04-21-2009, 05:50 PM

In mchaisson's paper, mate-pairs significantly improve assembly but read length doesn't until it reaches a certain read length barrier (35nt for E.coli and 60nt for S.cerevisiae)

Low coverage might be a problem. Longer reads result in lower coverage in certain regions. My guess is that velvet will try to give the most N50 contigs and will not map these reads to those poorly covered regions. Therefore, you get fragmented assembly.

Correct me if I'm wrong
Melissa

**Peter Bjarke Olsen** · 04-29-2009, 07:21 AM

Thanks for the answers. I think you are on to something bioinfosm. I have reduced the raw data with 36% with the fastx-toolkit and get a more reasoable amount of contigs with the expected genome size. I still find it strange that I can get the same (good) result by splitting the reads in to 36 bp fragments.

**jnfass** · 05-11-2009, 11:31 AM

I find that pretty strange too, Peter. Forgive me, but are you absolutely sure that what you have aren't paired 36bp reads? That would explain why leaving them together mucks up your assembly. Is it possible that your source did paired end reads and didn't pass that on to you?

Another way to look at that would be to plot the average base quality versus cycle. If they're actually paired, then the most likely way you've received them would be that each "read" is actually cycles 1-72, but 1-36 are the forward read (5' to 3') and 37-72 are the reverse read (5' to 3'). So if you plot the qualities, or base content (fraction A, T, C, or G), you should see a certain pattern going from 1-36, then repeated from 37-72 ... e.g. mean quality decreasing until 36, then jumping back up before decreasing with the same curve from 37 to 72.

Sorry if I've run with an impossible theory here, but if I didn't know any better, it's something I'd suspect. (Except for the fact that ~1/3 of your "paired" reads actually result in a decent assembly ... that would seem to shoot my theory down) ...

**Peter Bjarke Olsen** · 06-04-2009, 06:34 AM

I understand your point. I also thought that it could be because the reads were two 36 bp reads instead of 72 base pair reads so it was one of the first things i checked. The quality drops gradually over the 72 bases but nothing drastic.

Originally posted by jnfass View Post

I find that pretty strange too, Peter. Forgive me, but are you absolutely sure that what you have aren't paired 36bp reads? That would explain why leaving them together mucks up your assembly. Is it possible that your source did paired end reads and didn't pass that on to you?

Another way to look at that would be to plot the average base quality versus cycle. If they're actually paired, then the most likely way you've received them would be that each "read" is actually cycles 1-72, but 1-36 are the forward read (5' to 3') and 37-72 are the reverse read (5' to 3'). So if you plot the qualities, or base content (fraction A, T, C, or G), you should see a certain pattern going from 1-36, then repeated from 37-72 ... e.g. mean quality decreasing until 36, then jumping back up before decreasing with the same curve from 37 to 72.

Sorry if I've run with an impossible theory here, but if I didn't know any better, it's something I'd suspect. (Except for the fact that ~1/3 of your "paired" reads actually result in a decent assembly ... that would seem to shoot my theory down) ...

**mchaisso** · 06-04-2009, 10:43 AM

I've noticed that there are enough extra errors at the ends of some of the 'longer' reads to muck up assemblies. Some of the coverage statistics used to remove erroneous edges that worked for shorter reads do not work as well with the longer ones. This is the case with euler, and likely with velvet as well.

**Jonathan** · 06-16-2009, 01:44 PM

That coverage thingy is an aspect of velvet which baffles me quite a bit...:
Testrun for PhiX, 76bp PE, 3,9mio Readpairs: 1000+ contigs
20000 reads out of 3,9 mio: ~40 contigs, with aggressive parameter tuning: 30, with n50>300
10000 reads out of 3,9 mio: 1 contig, of perfect size with 3 SNPs as Blast discovers... (and without any parameter tuning)

I'm still wondering why velvet is soooo coverage-'sensitive'
-Jonathan

**shahid.manzoor** · 06-26-2009, 05:36 AM

I have reads data for pollen beetle from illumina machine of 75bp for paired end allignment. i want to know that velvet will perform well for these to assemble by using de novo. how many memory will be required for that because i have 8GB ram and my two reads files size is 1.92GB eahc,so velvet give malloc segmentation error.
secondly i want to know that how we can convert the velvet output .afg file to .ace file.

**Jonathan** · 06-26-2009, 05:50 AM

With 8 GB of ram, you might be able to run the assembly for ~ 2mio reads - that is single-ended, paired-end would be ~1mio.
All using the highest k-/hash-value of 31 for the initial hashing step.

You might want to try to ramp it up if it actually works, but I can tell you 48GB ram is not enough for 9.8 mio reads (K=31) - that is 4.9mio in PE
(My guess is: 55-70GB for that amount of data).

BUT: this is just empirical, and only for my dataset - the internal graph-structure of your assemlby might be far better structured (by chance, mind you!) and consume less space - or more.

Edit:
Additionally: depending on the size of your organism, you might actually REALLY WANT to split the data, as velvet tends to be ... itchy when it comes to 'deeper' coverage (I have not yet evaluated this soft-border, a guess would be ~50x to 100x or more?)

Best
-Jonathan

**ohlsson** · 07-08-2009, 03:27 AM

I'm working with shahid.manzoor on the same set of ~11 million 76bp read pairs. We just recently got Velvet working (just needed to recompile for the correct (64 bit) environment ... *facepalm*), and I have completed a few runs with k-mer size 55-63. I'm testing with large k-mer sizes mainly so that I can see the results within 24 hours, until I get an idea of how to get useful data out of Velvet.
Our major problem right now is contig size and coverage. So far the largest contig I've received is 204bp, and most contigs consist of only two overlapping reads.
So far I have avoided setting any parameters other than -ins_length 187, which was provided with the Illumina output. Which parameters would you recommend changing, in order to get longer contigs?

F1rst greetings,
Ingemar

**bioinfosm** · 07-08-2009, 08:15 AM

cov_cutoff is one parameter that turns out to be important..
what kind of coverage depth do you expect with all these reads? Having it close to 40 helps with velvet (from my experience)

**ohlsson** · 07-09-2009, 12:23 AM

Thanks for the tip, I'll try changing the coverage cutoff today.
Our coverage depth may turn out to be a problem. Assuming a genome size of about 200 MB (like the model Tribolium castaneum) we have only about 5x coverage. =/ For some reason, the lab that did the sequencing did not provide any estimate of the genome size or coverage, so my assumptions are all I have to go on.

EDIT:
Sorry everyone, false alarm. Turns out that the sequencing lab bungled the run by gathering data from the wrong lanes ... We have been trying to build a 200 MB genome using reads from the phiX control kit.

**swbarnes2** · 07-09-2009, 10:51 AM

Originally posted by ohlsson View Post

Thanks for the tip, I'll try changing the coverage cutoff today.
Our coverage depth may turn out to be a problem. Assuming a genome size of about 200 MB (like the model Tribolium castaneum) we have only about 5x coverage. =/ For some reason, the lab that did the sequencing did not provide any estimate of the genome size or coverage, so my assumptions are all I have to go on.

5x coverage?

Velvet is not going to like that. Velvet wants more like 50x to get nice big contigs.

11 Mreads on a 5 Mb genome would probably velvet nicely. I think 20 Mb would be pushing it. And if you aren't getting any velvet contigs...that's probably why.

**guisinmm** · 05-12-2010, 01:52 PM

This thread is making me wonder about my data. I have 9x10^6 reads of 75 bp, and I want the plastid genome. These genomes are quite small (only ~135kb), and DNA extractions usually contain LOTS of plastid DNA. I would expect to recover the plastid genome in relatively large pieces, but I'm not! The largest contigs that I am recovering are only around 3kb.

Could it be that the coverage is too high and Velvet is having problems with this?
What do you all suggest?
Should I use the fastx toolkit to make shorter reads?
Should I use fewer reads and then combine contigs post-Velvet?
Other suggestions?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Velvet and long Illumina reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News