Seqanswers Leaderboard Ad

**BaCh** · 03-25-2009, 02:31 AM

Originally posted by greigite View Post

Hi all
[...] I calculate that this is about 64% of the total sequence produced for the lane. I haven't come across any other similar calculations in the literature. What is your experience with the total amount of sequencing needed to ensure 20-30x Q20 coverage?

25%-30% sequence loss is "absolute worst case" for me and now happens only for runs that were "bad" and if I don't massage the data prior to mapping. Rule of thumb at the moment is 15%-20% loss (including clipping) for bad cases.

As example for good cases, here are some numbers for a project that went well in terms of quality (5.8m reads, 40mers). Pre-assembly QC gave these numbers:

Num reads clipped left: 129k
Num reads clipped right: 385k
Reads completely clipped: 73k

In assembly:

5.6m reads mapped 100% to the reference
134k mapped "with errors" (including SNPs)

The theoretical maximum average coverage would have been 55.2x and the achieved coverage was 53.3x.

Which equals to ~3.5% loss. Not too shabby.

However, the bigger problem for you is the consistency of what you get from sequencing in terms of raw numbers: even good labs with 7 to 9m 40mer-reads per lane have every once in a while a run which yields only 3 to 3.5m reads. Normally they'll redo it for you free of charge, but you have to account for that.

B.

PS: 53x is way too much coverage

**greigite** · 03-25-2009, 08:51 AM

Thanks, BaCh, that is very helpful. When you say 53x is way too much coverage, what would you consider sufficient but not excessive coverage for your purposes?

**BaCh** · 03-25-2009, 11:38 AM

Incidentally, the coverage also depends a bit on the length of the reads. For 36 and 40mers I've run experiments by starting at ~35x and reducing down to 15x:

everything above 30x and minimal tidying in an editor gets you everything
at 25x, I haven't found a case where I would've missed a SNP, but sometimes coverage started to be thin. When using non-paired reads, insertions start to be hard to locate as well as the exact end points of genome duplications
at 20x some spots with extremly thin coverage and a couple of holes, true SNPs sometimes covered only by 3 or 4 sequences (granted: frameshifts in homopolymers, difficult to see anyway)
at 15x there were definitively SNPs lost and multiple regions in the genome that were not covered (sometimes only one base, sometimes a dozen or more)

Please note that I do some hand editing on the assemblies and check everything not only by statistics, but by visual inspection. Your mileage may vary. For 76mers first results let me think I'll get away with a bit less coverage, but I haven't checked thoroughly yet.

Please also have a look at this paper from the Sanger Centre in Nature from Nov. last year: http://www.ncbi.nlm.nih.gov/pubmed/18987734
It's a pretty good read and, if I remember correctly, the basic conclusions on coverage are comparable.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Percentage of usable data per lane

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News