Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Percentage of usable data per lane

    Hi all
    I'm planning a project to sequence multiple bacterial strains and am trying to calculate how many samples I can multiplex per lane while getting sufficient coverage to accurately determine polymorphisms. According to the VAAL paper from the Broad (http://www.nature.com/nmeth/journal/...meth.1286.html), 1 lane of 36 bp reads for Staph aureus gave ~53x Q20 coverage (table S1). I calculate that this is about 64% of the total sequence produced for the lane. I haven't come across any other similar calculations in the literature. What is your experience with the total amount of sequencing needed to ensure 20-30x Q20 coverage?

  • #2
    Originally posted by greigite View Post
    Hi all
    [...] I calculate that this is about 64% of the total sequence produced for the lane. I haven't come across any other similar calculations in the literature. What is your experience with the total amount of sequencing needed to ensure 20-30x Q20 coverage?
    25%-30% sequence loss is "absolute worst case" for me and now happens only for runs that were "bad" and if I don't massage the data prior to mapping. Rule of thumb at the moment is 15%-20% loss (including clipping) for bad cases.

    As example for good cases, here are some numbers for a project that went well in terms of quality (5.8m reads, 40mers). Pre-assembly QC gave these numbers:
    • Num reads clipped left: 129k
    • Num reads clipped right: 385k
    • Reads completely clipped: 73k


    In assembly:
    • 5.6m reads mapped 100% to the reference
    • 134k mapped "with errors" (including SNPs)


    The theoretical maximum average coverage would have been 55.2x and the achieved coverage was 53.3x.

    Which equals to ~3.5% loss. Not too shabby.

    However, the bigger problem for you is the consistency of what you get from sequencing in terms of raw numbers: even good labs with 7 to 9m 40mer-reads per lane have every once in a while a run which yields only 3 to 3.5m reads. Normally they'll redo it for you free of charge, but you have to account for that.

    B.

    PS: 53x is way too much coverage

    Comment


    • #3
      Thanks, BaCh, that is very helpful. When you say 53x is way too much coverage, what would you consider sufficient but not excessive coverage for your purposes?

      Comment


      • #4
        Incidentally, the coverage also depends a bit on the length of the reads. For 36 and 40mers I've run experiments by starting at ~35x and reducing down to 15x:
        • everything above 30x and minimal tidying in an editor gets you everything
        • at 25x, I haven't found a case where I would've missed a SNP, but sometimes coverage started to be thin. When using non-paired reads, insertions start to be hard to locate as well as the exact end points of genome duplications
        • at 20x some spots with extremly thin coverage and a couple of holes, true SNPs sometimes covered only by 3 or 4 sequences (granted: frameshifts in homopolymers, difficult to see anyway)
        • at 15x there were definitively SNPs lost and multiple regions in the genome that were not covered (sometimes only one base, sometimes a dozen or more)


        Please note that I do some hand editing on the assemblies and check everything not only by statistics, but by visual inspection. Your mileage may vary. For 76mers first results let me think I'll get away with a bit less coverage, but I haven't checked thoroughly yet.

        Please also have a look at this paper from the Sanger Centre in Nature from Nov. last year: http://www.ncbi.nlm.nih.gov/pubmed/18987734
        It's a pretty good read and, if I remember correctly, the basic conclusions on coverage are comparable.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        22 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        17 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Working...
        X