Seqanswers Leaderboard Ad

**nickloman** · 02-29-2012, 02:42 AM

Those statistics are pretty poor for a typical bacterial genome so you should be looking to improve on that N50 dramatically.

Use only a subset of those reads and aim for 30x - 100x coverage (you could filter for the highest quality reads).

Too many reads will slow down your experiments, and won't improve the assembly. In fact too many reads can cause the assembly to fragment due to a build-up of erroneous k-mers.

As your pairs are overlapping, try joining them together with a tool like FLASH
http://bioinformatics.oxfordjournals...tr507.abstract and then assemble these longer reads (try both an OLC assembler like MIRA and a de Bruijn assembler like Velvet - you can increase the k-mer size).

**Noa** · 02-29-2012, 03:00 AM

Thanks for your help- I know the statistics are poor -happy for any advice as I am new to all this!

I have been working on a subset of the reads. I am using velvetOptimiser so I scanned all possible kmers from 19 to 31 for each of either 250,000 reads, 1mil, or 2mil reads, and I tried both single and paired end options. The best so far was 2million paired end (those are the stat's I quoted). Should I give even more reads? I have been trimming the reads from base #5 to 60, which gives good looking boxplots with everything with at least a quality score of >20. Happy for suggestions on other ways to filter.

I had also originally tried to work with data that was paried already using SHERA but didnt seem to get significantly improved results. I may readdress that now that I am slightly more proficient with velvet.

I also tried MIRA yesterday for the first time and got with 2 million trimmed reads (55bp): 22000 contigs covering consensus of 4.8M largest conti is 3200bp, N50 is 287 - which is also not so great for a bacterial genome. Happy for advice there too!

**nickloman** · 02-29-2012, 03:07 AM

I have to say your results are curiously poor. 2m reads should give you 30x coverage which I would expect to give a much better result than you are getting.

What happens if you just take the first 2m reads from the (hopefully correctly shuffled, should be pair /1 followed by /2) file, don't filter at all and pass to Velveth with -shortPaired and k-mer of 31 and Velvetg with -exp_cov auto -cov_cutoff auto -ins_length 150?

Also are you absolutely sure you have got the correct insert length?

**Noa** · 02-29-2012, 03:35 AM

I agree they are much worse results than expected.
I just rechecked the shuffling and it is /1 and then /2 as expected (I used the shuffleSequences perl script that comes with vlevet for that)

There are 4mil reads in total (2 mil of each end) just to make sure that is clear.

I just tried with the unfiltered sequences (which are longer- but the quality gets really bad after about 90bp.) running as you suggested but the process got killed - i will try to find a computer with more memory... Is it worth trying with half the sequences?

Insert size is definitely ~150- i just double checked with the guy who did the actual experiment, and also that is what velvetoptimiser spits out at the end of the run when I dont give it that info.

**nickloman** · 02-29-2012, 03:50 AM

I thought your reads were 75bp long? Not sure what you mean by quality dropping off after 90bp.

Quality may well be the problem. Can you post a FastQC chart?

**nickloman** · 02-29-2012, 03:50 AM

And yes, try with 1/2 the sequences.

**vtosha** · 02-29-2012, 03:52 AM

Try k-mers more than 31. We got a good results from 37 to 47. -shortPaired. -exp_cov auto Try maximum number of reads your computer can treat. We got best results on 15+15 mln and 20+20 mln, less was worse.

**Noa** · 02-29-2012, 03:55 AM

Sorry - here is a boxplot. The original reads were 150bp (144bp after barcode removal), I chopped them up to 56bp (start at bp#5, end at bp#60)

Attached Files

SeqAns.png (31.9 KB, 77 views)

**Noa** · 02-29-2012, 04:03 AM

Ive tried larger kmers and I wasnt getting much better results. The velvetOptimiser is such a useful tool and right now it is only working up to 31mers. (Not sure if i can reinstall it to do higher kmers? I have to ask the guy who installed it...)

**vtosha** · 02-29-2012, 04:07 AM

We try with Perl-script of our bioinformatician. As I understand, for IT it is not too difficult to write such script for autoscreening Velvet's parametres.

**Noa** · 02-29-2012, 04:08 AM

Thanks- we dont have IT here to help unfortunately.I am a wet biologist learning the computation stuff on my own! (actually enjoying it but it is slow going!)
Just to make sure- you were doing 20million x2 reads on what size genome?

**Noa** · 02-29-2012, 04:12 AM

Nick-
I still cant run with hald the seq's. Will try to get hold of a stronger computer but that has to wait til our sys admin is back tomorrow. Any insights or things I can try in the meanwhile? Thanks so much for your help!

**Noa** · 02-29-2012, 04:13 AM

I just tried running as a 41mer with the PE 2millionx2 reads and I ma getting N50 of 243, max 5556, total of 3.6M genome. So that doesnt look too promising. (Maybe not surprising since the entire read I am using is 56 bp (the trimmed read) so maybe that is too large of a kmer.

**vtosha** · 02-29-2012, 04:16 AM

Some bacterial, near 5,5 M. 72nt, paired-end. N50 from 100 000 to 170 000. Not the best at all, but best our. But for some chloroplast genome we couldn't get any long contig with an excellent sequencing results, so problem may be in library preparation or DNA.
Yes, there may be optimal k-mer for your own results, and k-mers with more length not better.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Assembling bacterial genome de novo

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News