Seqanswers Leaderboard Ad

**Torst** · 04-17-2013, 12:48 AM

You should run velvetg WITHOUT the two auto options first if you want to examine the coverage plot in R.

There is no need to set -max_coverage unless you know you have some known HIGH level of undesirable contamination. Those high coverage contigs are important - they will likely be your ribosomal RNA locii and insertion sequences.

If you are using 250bp PE MiSeq data I would first run FLASH to pre-combine and overlapping pairs. Also, if Nextera was used to prep your samples, and your reads are all exactly 250bp long, then your provider probably did NOT remove all the Nextera adaptors so you will have to do it.

For the record, 360 contigs for a 6 Mbp bacteria is pretty good, considering it will probably have a bunch of insertion sequences. Pseudomonas?

**reubennowell** · 04-17-2013, 08:43 AM

Hello and thanks for the advice Torst,

Firstly, yes, these are Pseudomonas genomes - well deduced! How did you guess??

So, just to clarify, the reads have been adapter-trimmed using CutAdapt and quality-trimmed using Condetri, so there is considerable variation in read length in the input FastQ files.

Doing as you suggest without setting any max_coverage gives 487 contigs (after scaffolding and gap-filling with SSPACE and GapFiller)... I'm just a bit surprised it's not lower than that, considering the length of reads.

I did in fact try FLASH, and it combined a whopping ~80% of my reads. But running velvet on these data did not improve the contig numbers/n50 etc. It also lead to really strange coverage distributions, where a huge proportion of the nodes had a coverage of only 1 and thus being discarded when cov_cutoff is set. I suspect this may be due to FLASh not working optimally - perhaps combining reads that should not be combined or something? Have you used FLASh on MiSeq reads with success before? I can't help feeling that something is not quite right with it yet!

Thanks again, and any feedback or tips are most welcome

Reuben

**Torst** · 04-17-2013, 03:23 PM

The guess of Pseudomonas was based on genome size, number of contigs, and googling your name :-) I had similar results for the Pa AES strains.

Nextera seems to produce a lot of very short reads, many of which completely overlap each other. The default parameters of FLASH are not suitable for 250bp miseq reads. You need to change -M, -r, -f, and -s to bigger values, as the defaults are tuned for 2x100bp.

Ultimately we are finding many Nextera 2x250 runs aren't that much better than TruSeq 2x150 or 2x100 due to the smaller fragmentation. It's going to be a while before people fine tune it. If you can afford to use TruSeq size selection, then 2x250 sequence, then you might get better results.

**reubennowell** · 04-23-2013, 09:56 AM

Thanks again Torst, unfortunately the sequencing stage of our project is completed - we'll have to work with what we've got!

I have now tried using FLASH with more realistic input parameters (-m 100 -r 207 -f 302 -s 132), but again there is no 'improvement' (insofar as n50, number contigs etc go) in the final assembly... I find this a bit puzzling since there is much talk in the literature in how much improvement can be gained in combining reads like this.

I may open this up as a new thread in fact...

Cheers!

**boetsie** · 04-24-2013, 07:26 AM

Originally posted by reubennowell View Post

Thanks again Torst, unfortunately the sequencing stage of our project is completed - we'll have to work with what we've got!

I have now tried using FLASH with more realistic input parameters (-m 100 -r 207 -f 302 -s 132), but again there is no 'improvement' (insofar as n50, number contigs etc go) in the final assembly... I find this a bit puzzling since there is much talk in the literature in how much improvement can be gained in combining reads like this.

I may open this up as a new thread in fact...

Cheers!

if you have generated larger reads by combining the pairs, you could/should also set the k-mer size higher. Did you do that?

Regards,
Boetsie

**reubennowell** · 04-26-2013, 08:11 AM

Hi Boetsie, well I'm running velvet with a kmer of around 127 currently, so quite large already. I tried upping it a bit, with no improvements in the assembly statistics. But there is considerable variation in read length in both my unmerged paired-end reads and my merged reads (post FLASHing), like this:

notCombined:
Min. 1st Qu. Median Mean 3rd Qu. Max.
50.0 177.0 222.0 202.4 249.0 251.0

extendedFrags (ie. 'FLASHed' reads):
Min. 1st Qu. Median Mean 3rd Qu. Max.
100.0 195.0 241.0 243.9 294.0 402.0

Please correct me if I am wrong, but can Velvet use reads that are shorter than the kmer length? Because there are clearly a significant fraction of reads in both datasets that are relatively short. For example, if I up the kmer value to something around the 200 mark, does this render ~ one quarter of reads unusable? The kmer coverage plots (from the stats.txt Velvet output) assemblies using these FLASHed datasets look very strange too, with a very large number of kmers having a coverage of only 1...

I'm still trying to get my head around what FLASH is actually doing to the reads (I mean specifically to my reads) - perhaps it's stitching together reads very badly, resulting in a bunch of reads that do not assemble?

I just find it weird that FLASH certainly seems to merge a lot of my reads (ie look at the stats above), but that this has no effect (or a negative effect) on the resultant assembly...

Any ideas, then please do let me know!

Thanks,
Reuben

**ewilbanks** · 04-29-2013, 12:49 PM

my experience w/ assembling 2x250bp metagenomic data w/ idba_ud is that I saw little to no improvement using longer reads unless I increased the maxK to ~230-250. This will hog memory but assemblies (N50, N75) get significantly bigger.

**reubennowell** · 05-02-2013, 08:48 AM

That's interesting - so had you run FLASH or something similar also? Was the distribution of read lengths similar to what I have?

Also, idba_ud has a max kmer of <= 124. How did you recompile it with a larger max kmer? Can you remember which file you had to modify?

The README says to increase kMaxShortSequence in src/sequence/short_sequence.h, but not how to up the max kmer value.

Thanks!
Reuben

**ewilbanks** · 05-02-2013, 09:01 AM

Nope - I just threw the two reads in directly without end-assembling. Haven't tried assembling them together or anything yet.

To increase maxK, I modified src/basic/kmer.h and increased kNumUint64 from 4 to 8. Careful doing this, it really increases the required memory! With my dataset, I don't run this on machines with <32G of RAM. That said, the effect on N50 is pretty dramatic w larger maxK. Let me know if you need help getting that to work and I'll see about digging up my actual modified file.

**reubennowell** · 05-03-2013, 02:50 AM

Thanks! That worked a treat. And running it up to k=250 produced an assembly with the highest N50 I've managed to produce thus far (>60kb for a 6Mb bacterial genome), across a range of assemblers (and I've tried loads).

There are still a lot of short contigs though:
Number of contigs 847
Number of contigs > 500 nt 382

I'm going to try a read-correction procedure (Reptile probably) to see if I can bring this number down.

Thanks for your input, and any other tips or tricks please do let me know it's a great way of learning!

Reuben

**ewilbanks** · 05-03-2013, 09:24 AM

Great! I have noticed that too - for some reason, with these changes the minimum contig length cutoff doesn't seem to work properly. I've just been post-filtering to exclude the little stuff under 500bp or 1kb.

I'm curious - how did your scaffolds.fa look? One thing I've seen is that there are no Ns, suggesting really good overlap and coverage. Did you see something similar?

**reubennowell** · 05-06-2013, 02:55 AM

Yeah, there are zero N's in the scaffolds.fa file in my assemblies also

I'm going to run SSPACE and GapFiller anyway to see if there is further improvement.

RE the small contigs (<500bp): generally speaking, what are these?? Could they be single reads that don't map anywhere, possibly due to a large number of errors in that read? And is it generally acceptable to exclude these from final reports of assembly metrics?

I had a look at what effect both read-correction and read-merging had on assembly. I tried both Reptile and Coral read-correction, with both merged and unmerged datasets where possible (using FLASh). Generally, these pre-processes had little effect on the overall assembly metrics:

No read correction, no FLASH
# scaffolds = 847
# scaffolds >500 bp = 382
N50 = 61299
genome size (Mb) = 6.341

Reptile, no FLASH
# scaffolds = 845
# scaffolds >500 bp = 387
N50 = 62820
genome size (Mb) = 6.343

Coral, no FLASH
# scaffolds = 963
# scaffolds >500 bp = 452
N50 = 51815
genome size (Mb) = 6.335

No read correction, FLASH
# scaffolds = 1060
# scaffolds >500 bp = 397
N50 = 60802
genome size (Mb) = 6.401

I'm still a little puzzled as to why these pre-processes fail to improve assembly greatly (Reptile read-correction does produce a slightly better N50, but Reptile is not very easy to use......)

Using FLASH on Reptile-corrected reads would be interesting, but is super-faffy because of file output issues.

Cheers!

**hakattack** · 05-09-2013, 08:10 AM

Hi,

I have 2 short pair-ends Miseq fastq files (2*2,200,250 reads) where read length is 26 bases and I would like to perform de novo assembly for these reads. I just installed Velvet(g,h) from Galaxy Shed Tool. I used the tools on galaxy but I keep getting no result. I figured this could be due to the parameters I am using. I used the default parameters since I am not sure which one to change to meet my short read length spec.

I decided to do this from the command line.
1. I shuffled both fastq files into one
2. then, velveth auto 31,42,2 -fastaq -shortPaired1 shuffled_sequences.fq
I got one auto_31 folder but now I am not sure what to do for VAL1 and VAL2 in command
velvetg auto_31 -exp_cov VAL1 -ins_length1 VAL2

I would appreciate any kind of help either how to work with Galaxy or Command line options.

Regards,

**mastal** · 05-09-2013, 08:43 AM

How to set -max_coverage in velvetg

Hi,

I'm not very familiar with galaxy, but I am familiar with velvet.

The kmer size has to be an odd number, and you can't have a kmer size longer than your reads.

So if your reads are 26 bp, the longest kmer you can try is 25,
so 31,42,2 will not work.

When you first run velvetg you can set the -exp_cov to auto, and the program will calculate a value for exp_cov.

ins_length for velvet is the size of the 2 reads in the pair, plus the distance betwen them. you should have some estimate of the insert size for the library from your sequence provider. otherwise velvet has a script that can calcualte an insert length from the assembled read pairs.

Hope this helps,
maria

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

How to set -max_coverage in velvetg

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News