SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory issue with velvetg nposnien Bioinformatics 3 01-03-2014 09:34 PM
Velvetg output files nisha2683 Introductions 0 11-02-2012 10:00 AM
meta-velvetg cannot detect peak errors plumb Bioinformatics 1 04-05-2012 12:03 PM
unreproducible velvetg results? jeffgao Bioinformatics 5 03-11-2012 01:32 PM
How to estimate time needed for velvetg shuang Bioinformatics 0 08-23-2011 06:03 AM

Reply
 
Thread Tools
Old 04-16-2013, 07:43 AM   #1
reubennowell
Member
 
Location: Edinburgh

Join Date: Jan 2013
Posts: 18
Default How to set -max_coverage in velvetg

Hello,

I am assembling bacterial genomes (~6Mb) using 250 bp paired-end MiSeq data. I have tried a bunch of assemblers (idba_ud, mira, ray, SOAPdenovo, ABySS to name a few...), but am getting reasonably good results using good old velvet (~360 contigs, n50 = 40kb). But I have a question about how to set the velvetg parameter -max_coverage? It's value has a large effect on the resulting number of contigs and total number of bases in the assembly (ie assembled genome size). Am I correct in thinking that many of these high-coverage nodes errors (or at least error-prone, like repeat elements etc) and should be excluded for a better assembly?

I estimate the coverage distribution (in R using plotrix) from the stats.txt file after running a preliminary:
Code:
velvetg velvet_big_127 -cov_cutoff auto -exp_cov auto
. It is then fairly easy to calculate the weighted mean coverage -exp_cutoff and to set a reasonable value for -cov_cutoff, but there is often a long tail in the distribution meaning that there are small number of nodes with very high coverage.

Generally, what is a good way to determine a sensible value for -max_coverage?

Many thanks! Reuben
reubennowell is offline   Reply With Quote
Old 04-17-2013, 12:48 AM   #2
Torst
Senior Member
 
Location: The University of Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 275
Default

You should run velvetg WITHOUT the two auto options first if you want to examine the coverage plot in R.

There is no need to set -max_coverage unless you know you have some known HIGH level of undesirable contamination. Those high coverage contigs are important - they will likely be your ribosomal RNA locii and insertion sequences.

If you are using 250bp PE MiSeq data I would first run FLASH to pre-combine and overlapping pairs. Also, if Nextera was used to prep your samples, and your reads are all exactly 250bp long, then your provider probably did NOT remove all the Nextera adaptors so you will have to do it.

For the record, 360 contigs for a 6 Mbp bacteria is pretty good, considering it will probably have a bunch of insertion sequences. Pseudomonas?
Torst is offline   Reply With Quote
Old 04-17-2013, 08:43 AM   #3
reubennowell
Member
 
Location: Edinburgh

Join Date: Jan 2013
Posts: 18
Default

Hello and thanks for the advice Torst,

Firstly, yes, these are Pseudomonas genomes - well deduced! How did you guess??

So, just to clarify, the reads have been adapter-trimmed using CutAdapt and quality-trimmed using Condetri, so there is considerable variation in read length in the input FastQ files.

Doing as you suggest without setting any max_coverage gives 487 contigs (after scaffolding and gap-filling with SSPACE and GapFiller)... I'm just a bit surprised it's not lower than that, considering the length of reads.

I did in fact try FLASH, and it combined a whopping ~80% of my reads. But running velvet on these data did not improve the contig numbers/n50 etc. It also lead to really strange coverage distributions, where a huge proportion of the nodes had a coverage of only 1 and thus being discarded when cov_cutoff is set. I suspect this may be due to FLASh not working optimally - perhaps combining reads that should not be combined or something? Have you used FLASh on MiSeq reads with success before? I can't help feeling that something is not quite right with it yet!

Thanks again, and any feedback or tips are most welcome
Reuben
reubennowell is offline   Reply With Quote
Old 04-17-2013, 03:23 PM   #4
Torst
Senior Member
 
Location: The University of Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 275
Default

The guess of Pseudomonas was based on genome size, number of contigs, and googling your name :-) I had similar results for the Pa AES strains.

Nextera seems to produce a lot of very short reads, many of which completely overlap each other. The default parameters of FLASH are not suitable for 250bp miseq reads. You need to change -M, -r, -f, and -s to bigger values, as the defaults are tuned for 2x100bp.

Ultimately we are finding many Nextera 2x250 runs aren't that much better than TruSeq 2x150 or 2x100 due to the smaller fragmentation. It's going to be a while before people fine tune it. If you can afford to use TruSeq size selection, then 2x250 sequence, then you might get better results.
Torst is offline   Reply With Quote
Old 04-23-2013, 09:56 AM   #5
reubennowell
Member
 
Location: Edinburgh

Join Date: Jan 2013
Posts: 18
Default

Thanks again Torst, unfortunately the sequencing stage of our project is completed - we'll have to work with what we've got!

I have now tried using FLASH with more realistic input parameters (-m 100 -r 207 -f 302 -s 132), but again there is no 'improvement' (insofar as n50, number contigs etc go) in the final assembly... I find this a bit puzzling since there is much talk in the literature in how much improvement can be gained in combining reads like this.

I may open this up as a new thread in fact...

Cheers!
reubennowell is offline   Reply With Quote
Old 04-24-2013, 07:26 AM   #6
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by reubennowell View Post
Thanks again Torst, unfortunately the sequencing stage of our project is completed - we'll have to work with what we've got!

I have now tried using FLASH with more realistic input parameters (-m 100 -r 207 -f 302 -s 132), but again there is no 'improvement' (insofar as n50, number contigs etc go) in the final assembly... I find this a bit puzzling since there is much talk in the literature in how much improvement can be gained in combining reads like this.

I may open this up as a new thread in fact...

Cheers!
if you have generated larger reads by combining the pairs, you could/should also set the k-mer size higher. Did you do that?

Regards,
Boetsie
boetsie is offline   Reply With Quote
Old 04-26-2013, 08:11 AM   #7
reubennowell
Member
 
Location: Edinburgh

Join Date: Jan 2013
Posts: 18
Default

Hi Boetsie, well I'm running velvet with a kmer of around 127 currently, so quite large already. I tried upping it a bit, with no improvements in the assembly statistics. But there is considerable variation in read length in both my unmerged paired-end reads and my merged reads (post FLASHing), like this:

notCombined:
Min. 1st Qu. Median Mean 3rd Qu. Max.
50.0 177.0 222.0 202.4 249.0 251.0

extendedFrags (ie. 'FLASHed' reads):
Min. 1st Qu. Median Mean 3rd Qu. Max.
100.0 195.0 241.0 243.9 294.0 402.0

Please correct me if I am wrong, but can Velvet use reads that are shorter than the kmer length? Because there are clearly a significant fraction of reads in both datasets that are relatively short. For example, if I up the kmer value to something around the 200 mark, does this render ~ one quarter of reads unusable? The kmer coverage plots (from the stats.txt Velvet output) assemblies using these FLASHed datasets look very strange too, with a very large number of kmers having a coverage of only 1...

I'm still trying to get my head around what FLASH is actually doing to the reads (I mean specifically to my reads) - perhaps it's stitching together reads very badly, resulting in a bunch of reads that do not assemble?

I just find it weird that FLASH certainly seems to merge a lot of my reads (ie look at the stats above), but that this has no effect (or a negative effect) on the resultant assembly...

Any ideas, then please do let me know!

Thanks,
Reuben
reubennowell is offline   Reply With Quote
Old 04-29-2013, 12:49 PM   #8
ewilbanks
Member
 
Location: Davis, CA

Join Date: Mar 2009
Posts: 82
Default

my experience w/ assembling 2x250bp metagenomic data w/ idba_ud is that I saw little to no improvement using longer reads unless I increased the maxK to ~230-250. This will hog memory but assemblies (N50, N75) get significantly bigger.
ewilbanks is offline   Reply With Quote
Old 05-02-2013, 08:48 AM   #9
reubennowell
Member
 
Location: Edinburgh

Join Date: Jan 2013
Posts: 18
Default

That's interesting - so had you run FLASH or something similar also? Was the distribution of read lengths similar to what I have?

Also, idba_ud has a max kmer of <= 124. How did you recompile it with a larger max kmer? Can you remember which file you had to modify? The README says to increase kMaxShortSequence in src/sequence/short_sequence.h, but not how to up the max kmer value.

Thanks!
Reuben
reubennowell is offline   Reply With Quote
Old 05-02-2013, 09:01 AM   #10
ewilbanks
Member
 
Location: Davis, CA

Join Date: Mar 2009
Posts: 82
Default

Nope - I just threw the two reads in directly without end-assembling. Haven't tried assembling them together or anything yet.

To increase maxK, I modified src/basic/kmer.h and increased kNumUint64 from 4 to 8. Careful doing this, it really increases the required memory! With my dataset, I don't run this on machines with <32G of RAM. That said, the effect on N50 is pretty dramatic w larger maxK. Let me know if you need help getting that to work and I'll see about digging up my actual modified file.

ewilbanks is offline   Reply With Quote
Old 05-03-2013, 02:50 AM   #11
reubennowell
Member
 
Location: Edinburgh

Join Date: Jan 2013
Posts: 18
Default

Thanks! That worked a treat. And running it up to k=250 produced an assembly with the highest N50 I've managed to produce thus far (>60kb for a 6Mb bacterial genome), across a range of assemblers (and I've tried loads).

There are still a lot of short contigs though:
Number of contigs 847
Number of contigs > 500 nt 382

I'm going to try a read-correction procedure (Reptile probably) to see if I can bring this number down.

Thanks for your input, and any other tips or tricks please do let me know it's a great way of learning!

Reuben
reubennowell is offline   Reply With Quote
Old 05-03-2013, 09:24 AM   #12
ewilbanks
Member
 
Location: Davis, CA

Join Date: Mar 2009
Posts: 82
Default

Great! I have noticed that too - for some reason, with these changes the minimum contig length cutoff doesn't seem to work properly. I've just been post-filtering to exclude the little stuff under 500bp or 1kb.

I'm curious - how did your scaffolds.fa look? One thing I've seen is that there are no Ns, suggesting really good overlap and coverage. Did you see something similar?
ewilbanks is offline   Reply With Quote
Old 05-06-2013, 02:55 AM   #13
reubennowell
Member
 
Location: Edinburgh

Join Date: Jan 2013
Posts: 18
Default

Yeah, there are zero N's in the scaffolds.fa file in my assemblies also I'm going to run SSPACE and GapFiller anyway to see if there is further improvement.

RE the small contigs (<500bp): generally speaking, what are these?? Could they be single reads that don't map anywhere, possibly due to a large number of errors in that read? And is it generally acceptable to exclude these from final reports of assembly metrics?

I had a look at what effect both read-correction and read-merging had on assembly. I tried both Reptile and Coral read-correction, with both merged and unmerged datasets where possible (using FLASh). Generally, these pre-processes had little effect on the overall assembly metrics:

No read correction, no FLASH
# scaffolds = 847
# scaffolds >500 bp = 382
N50 = 61299
genome size (Mb) = 6.341

Reptile, no FLASH
# scaffolds = 845
# scaffolds >500 bp = 387
N50 = 62820
genome size (Mb) = 6.343

Coral, no FLASH
# scaffolds = 963
# scaffolds >500 bp = 452
N50 = 51815
genome size (Mb) = 6.335

No read correction, FLASH
# scaffolds = 1060
# scaffolds >500 bp = 397
N50 = 60802
genome size (Mb) = 6.401

I'm still a little puzzled as to why these pre-processes fail to improve assembly greatly (Reptile read-correction does produce a slightly better N50, but Reptile is not very easy to use......)

Using FLASH on Reptile-corrected reads would be interesting, but is super-faffy because of file output issues.

Cheers!
reubennowell is offline   Reply With Quote
Old 05-09-2013, 08:10 AM   #14
hakattack
Junior Member
 
Location: IA,USA

Join Date: Jan 2013
Posts: 9
Default

Hi,

I have 2 short pair-ends Miseq fastq files (2*2,200,250 reads) where read length is 26 bases and I would like to perform de novo assembly for these reads. I just installed Velvet(g,h) from Galaxy Shed Tool. I used the tools on galaxy but I keep getting no result. I figured this could be due to the parameters I am using. I used the default parameters since I am not sure which one to change to meet my short read length spec.

I decided to do this from the command line.
1. I shuffled both fastq files into one
2. then, velveth auto 31,42,2 -fastaq -shortPaired1 shuffled_sequences.fq
I got one auto_31 folder but now I am not sure what to do for VAL1 and VAL2 in command
velvetg auto_31 -exp_cov VAL1 -ins_length1 VAL2

I would appreciate any kind of help either how to work with Galaxy or Command line options.

Regards,
hakattack is offline   Reply With Quote
Old 05-09-2013, 08:43 AM   #15
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default How to set -max_coverage in velvetg

Hi,

I'm not very familiar with galaxy, but I am familiar with velvet.

The kmer size has to be an odd number, and you can't have a kmer size longer than your reads.

So if your reads are 26 bp, the longest kmer you can try is 25,
so 31,42,2 will not work.

When you first run velvetg you can set the -exp_cov to auto, and the program will calculate a value for exp_cov.

ins_length for velvet is the size of the 2 reads in the pair, plus the distance betwen them. you should have some estimate of the insert size for the library from your sequence provider. otherwise velvet has a script that can calcualte an insert length from the assembled read pairs.

Hope this helps,
maria
mastal is offline   Reply With Quote
Old 05-09-2013, 11:18 AM   #16
hakattack
Junior Member
 
Location: IA,USA

Join Date: Jan 2013
Posts: 9
Default

Thank you Maria. This is a big help for me. Taking your advice, here is what I did

veleth auto 25 -fastq -shortPaired1 myRef_shuffled.fq
velvetg auto_25 -exp_cov auto -ins_length1 150
I get
Final graph has 2 nodes and n50 of 1, max 1, total 2, using 0/10148414 reads

However, I get an empty contig.fa.

Now when when I try kmer 21,
velveth auto 21,25,3 -fastq -shortPaired1 myRef_shuffled.fq
velvetg auto_21 -exp_cov auto -ins_length1 150

I get some contigs, however, they do not align with my reference sequence (500 bases).

Do you see anything wrong I might be doing? I should get some contigs that align with the reference 100%.
hakattack is offline   Reply With Quote
Old 05-09-2013, 12:22 PM   #17
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default How to set -max_coverage in velvetg

the assembly with k=25 won't give you any contigs because velvet has a default min contig size, and contigs smaller than that won't be reported in the contigs.fa file.

you should probably try a wider range of kmer sizes initially, just run velvetg without any parameters other than the directory with the velveth results, then do more runs of velvetg using whichever kmer length or lengths look more promising. Hopefully if you manage to get a better assembly you will get something that aligns to your reference.
mastal is offline   Reply With Quote
Old 05-09-2013, 01:55 PM   #18
hakattack
Junior Member
 
Location: IA,USA

Join Date: Jan 2013
Posts: 9
Default

Thank you Mastal,

I have done what you suggested and I get a better luck than before. In one case, I got 100% when I tried k=19 but in another case, I tried all the ks= 15,17,19,21,23 and the best contig I got was 75% but not more than that so I gave up. I am assuming I should not expect better outcomes with such short reads, would you agree?

Thanks
hakattack is offline   Reply With Quote
Reply

Tags
coverage, genome assembly, velvet

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:14 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO