Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Percentage of usable data per lane

    Hi all
    I'm planning a project to sequence multiple bacterial strains and am trying to calculate how many samples I can multiplex per lane while getting sufficient coverage to accurately determine polymorphisms. According to the VAAL paper from the Broad (http://www.nature.com/nmeth/journal/...meth.1286.html), 1 lane of 36 bp reads for Staph aureus gave ~53x Q20 coverage (table S1). I calculate that this is about 64% of the total sequence produced for the lane. I haven't come across any other similar calculations in the literature. What is your experience with the total amount of sequencing needed to ensure 20-30x Q20 coverage?

  • #2
    Originally posted by greigite View Post
    Hi all
    [...] I calculate that this is about 64% of the total sequence produced for the lane. I haven't come across any other similar calculations in the literature. What is your experience with the total amount of sequencing needed to ensure 20-30x Q20 coverage?
    25%-30% sequence loss is "absolute worst case" for me and now happens only for runs that were "bad" and if I don't massage the data prior to mapping. Rule of thumb at the moment is 15%-20% loss (including clipping) for bad cases.

    As example for good cases, here are some numbers for a project that went well in terms of quality (5.8m reads, 40mers). Pre-assembly QC gave these numbers:
    • Num reads clipped left: 129k
    • Num reads clipped right: 385k
    • Reads completely clipped: 73k


    In assembly:
    • 5.6m reads mapped 100% to the reference
    • 134k mapped "with errors" (including SNPs)


    The theoretical maximum average coverage would have been 55.2x and the achieved coverage was 53.3x.

    Which equals to ~3.5% loss. Not too shabby.

    However, the bigger problem for you is the consistency of what you get from sequencing in terms of raw numbers: even good labs with 7 to 9m 40mer-reads per lane have every once in a while a run which yields only 3 to 3.5m reads. Normally they'll redo it for you free of charge, but you have to account for that.

    B.

    PS: 53x is way too much coverage

    Comment


    • #3
      Thanks, BaCh, that is very helpful. When you say 53x is way too much coverage, what would you consider sufficient but not excessive coverage for your purposes?

      Comment


      • #4
        Incidentally, the coverage also depends a bit on the length of the reads. For 36 and 40mers I've run experiments by starting at ~35x and reducing down to 15x:
        • everything above 30x and minimal tidying in an editor gets you everything
        • at 25x, I haven't found a case where I would've missed a SNP, but sometimes coverage started to be thin. When using non-paired reads, insertions start to be hard to locate as well as the exact end points of genome duplications
        • at 20x some spots with extremly thin coverage and a couple of holes, true SNPs sometimes covered only by 3 or 4 sequences (granted: frameshifts in homopolymers, difficult to see anyway)
        • at 15x there were definitively SNPs lost and multiple regions in the genome that were not covered (sometimes only one base, sometimes a dozen or more)


        Please note that I do some hand editing on the assemblies and check everything not only by statistics, but by visual inspection. Your mileage may vary. For 76mers first results let me think I'll get away with a bit less coverage, but I haven't checked thoroughly yet.

        Please also have a look at this paper from the Sanger Centre in Nature from Nov. last year: http://www.ncbi.nlm.nih.gov/pubmed/18987734
        It's a pretty good read and, if I remember correctly, the basic conclusions on coverage are comparable.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Recent Advances in Sequencing Analysis Tools
          by seqadmin


          The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
          05-06-2024, 07:48 AM
        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 07:03 AM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-10-2024, 06:35 AM
        0 responses
        31 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-09-2024, 02:46 PM
        0 responses
        41 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-07-2024, 06:57 AM
        0 responses
        33 views
        0 likes
        Last Post seqadmin  
        Working...
        X