SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
percentage similarity between genomes parulvk Bioinformatics 3 11-16-2011 11:36 AM
Percentage of mapped reads ? zack80.liu Bioinformatics 6 03-01-2011 10:08 AM
percentage coverage after alignment johnsequence Bioinformatics 9 03-15-2010 02:36 PM
How big is the data from one lane of solexa? tujchl Illumina/Solexa 2 11-11-2009 07:04 PM
General control lane used by Illumina/Solexa data?! edge Illumina/Solexa 6 09-14-2009 04:05 AM

Reply
 
Thread Tools
Old 03-24-2009, 05:08 PM   #1
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Question Percentage of usable data per lane

Hi all
I'm planning a project to sequence multiple bacterial strains and am trying to calculate how many samples I can multiplex per lane while getting sufficient coverage to accurately determine polymorphisms. According to the VAAL paper from the Broad (http://www.nature.com/nmeth/journal/...meth.1286.html), 1 lane of 36 bp reads for Staph aureus gave ~53x Q20 coverage (table S1). I calculate that this is about 64% of the total sequence produced for the lane. I haven't come across any other similar calculations in the literature. What is your experience with the total amount of sequencing needed to ensure 20-30x Q20 coverage?
greigite is offline   Reply With Quote
Old 03-25-2009, 03:31 AM   #2
BaCh
Member
 
Location: Germany

Join Date: May 2008
Posts: 79
Default

Quote:
Originally Posted by greigite View Post
Hi all
[...] I calculate that this is about 64% of the total sequence produced for the lane. I haven't come across any other similar calculations in the literature. What is your experience with the total amount of sequencing needed to ensure 20-30x Q20 coverage?
25%-30% sequence loss is "absolute worst case" for me and now happens only for runs that were "bad" and if I don't massage the data prior to mapping. Rule of thumb at the moment is 15%-20% loss (including clipping) for bad cases.

As example for good cases, here are some numbers for a project that went well in terms of quality (5.8m reads, 40mers). Pre-assembly QC gave these numbers:
  • Num reads clipped left: 129k
  • Num reads clipped right: 385k
  • Reads completely clipped: 73k

In assembly:
  • 5.6m reads mapped 100% to the reference
  • 134k mapped "with errors" (including SNPs)

The theoretical maximum average coverage would have been 55.2x and the achieved coverage was 53.3x.

Which equals to ~3.5% loss. Not too shabby.

However, the bigger problem for you is the consistency of what you get from sequencing in terms of raw numbers: even good labs with 7 to 9m 40mer-reads per lane have every once in a while a run which yields only 3 to 3.5m reads. Normally they'll redo it for you free of charge, but you have to account for that.

B.

PS: 53x is way too much coverage
BaCh is offline   Reply With Quote
Old 03-25-2009, 09:51 AM   #3
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default

Thanks, BaCh, that is very helpful. When you say 53x is way too much coverage, what would you consider sufficient but not excessive coverage for your purposes?
greigite is offline   Reply With Quote
Old 03-25-2009, 12:38 PM   #4
BaCh
Member
 
Location: Germany

Join Date: May 2008
Posts: 79
Default

Incidentally, the coverage also depends a bit on the length of the reads. For 36 and 40mers I've run experiments by starting at ~35x and reducing down to 15x:
  • everything above 30x and minimal tidying in an editor gets you everything
  • at 25x, I haven't found a case where I would've missed a SNP, but sometimes coverage started to be thin. When using non-paired reads, insertions start to be hard to locate as well as the exact end points of genome duplications
  • at 20x some spots with extremly thin coverage and a couple of holes, true SNPs sometimes covered only by 3 or 4 sequences (granted: frameshifts in homopolymers, difficult to see anyway)
  • at 15x there were definitively SNPs lost and multiple regions in the genome that were not covered (sometimes only one base, sometimes a dozen or more)

Please note that I do some hand editing on the assemblies and check everything not only by statistics, but by visual inspection. Your mileage may vary. For 76mers first results let me think I'll get away with a bit less coverage, but I haven't checked thoroughly yet.

Please also have a look at this paper from the Sanger Centre in Nature from Nov. last year: http://www.ncbi.nlm.nih.gov/pubmed/18987734
It's a pretty good read and, if I remember correctly, the basic conclusions on coverage are comparable.
BaCh is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:44 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO