Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • TomHarrop
    Member
    • Jul 2014
    • 20

    #16
    Hi Brian,

    Thanks for this tool.

    What is the perfect_prob column in the results? I can't see it in the docs. Is it the "probability of correctness" for k-mers (reads?) in that bin based on avg_quality?

    Also, if your percentage uniqueness for read 1 is only approaching 60% after ~250 M reads, would you keep sequencing?

    Cheers,

    Tom
    Attached Files

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      #17
      Hi Tom,

      perfect_prob is the average probability of a read being error-free within that interval. It's related to the avg_quality, but calculated independently. Possibly, it would make more sense for me to do this just for the kmer being used to track uniqueness rather than the whole read, but it's easiest this way. The reason I provide it is because low-quality regions in the fastq file will show inflated uniqueness, when uniqueness is tracked using this method.

      It looks like you're down to about 70% uniqueness for each individual read, which would be at least ~100x coverage for 150-bp reads... that coverage estimate is weighted by the high-coverage genomes, though.

      It's hard to say whether or not to sequence more based on this plot alone. You're obviously still generating more unique reads, but they might simply be giving more coverage to areas you can already assemble well. I think the best course is to assemble and see if you end up with a lot of short, low-coverage contigs (in addition to the high-coverage contigs that you will clearly generate)... in which case you do need to sequence more

      Comment

      • TomHarrop
        Member
        • Jul 2014
        • 20

        #18
        Great, thanks. How did you estimate the average coverage from the % uniqueness? I know I can do it more accurately with BBNorm (which says read depth median at 55x) but I'm curious how you did it from that plot. These are 100 b reads, but it's about 50 Gb of sequencing from a genome we are expecting to be ~600 Mbp, so you are not far off if coverage were even.

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #19
          Well, it's a very rough estimate, but...

          If 70% of reads are unique, then assuming an even distribution, 30% of the start sites are taken. Meaning there is one read for every 1/0.3 = 3.33 bases. For 150bp reads, that would indicate coverage of 150bp/3.33 = 45x. But since read 1 and read 2 are tracked independently, I doubled it to 90x. Then, since errors artificially inflate uniqueness calculation using this method, and given the % perfect profile, I guessed that maybe I should increase it by ~10%, so I arrived at ~100x coverage, but possibly more if the reads were lower-quality than they seemed based on the mapq.

          But, those estimates were based on 150bp reads... for 100bp reads the estimate would have been 66x+, which is not too far off from 55x. I initially thought this was a metagenome because of the sharp decrease in uniqueness at the very beginning of the file, but perhaps you just have a highly repetitive genome, or lots of duplicate reads. Was this library PCR-amplified? And did you trim adapters and remove phiX (if you spiked it in) prior to running the program? Also, is this a Nextera library; or, what method did you use for fragmentation? It's unusual for a PCR-free isolate to have such a sharp decrease in uniqueness at the beginning; that indicates there is some sequence that is extremely abundant in the library. Notably, the drop is not present in the paired uniqueness, which is completely linear. I'm not entirely sure what this means.

          At any rate, for an isolate, it looks like you've sequenced enough (for a diploid/haploid). Sometimes you can get a better assembly with more coverage, though, up to around 100x. And you certainly can't beat longer reads!
          Last edited by Brian Bushnell; 04-25-2017, 06:39 PM.

          Comment

          • TomHarrop
            Member
            • Jul 2014
            • 20

            #20
            Thanks for the suggestions.

            It's a TruSeq PCR-free library with an insert size around 470 bp according to the BioAnalyser. I did remove adaptors and contaminants with BBDuk2 (adapters.fa and phix174_ill files that ship with bbtools) but that only removed 0.05% of bases, so maybe I should be look again at the adaptor sequences.

            It's a diploid insect. I extracted the DNA from a single, whole individual so there may be some [gut] flora in there... or yes, a repetitive genome, but let's hope not.

            PacBio is too expensive for this project and we can't get enough DNA, but I'm looking into getting a MinION for gap closing.

            Comment

            • callumjcparr
              Junior Member
              • Feb 2019
              • 1

              #21
              Originally posted by Brian Bushnell View Post
              Hi Tom,

              perfect_prob is the average probability of a read being error-free within that interval. It's related to the avg_quality, but calculated independently. Possibly, it would make more sense for me to do this just for the kmer being used to track uniqueness rather than the whole read, but it's easiest this way. The reason I provide it is because low-quality regions in the fastq file will show inflated uniqueness, when uniqueness is tracked using this method.

              It looks like you're down to about 70% uniqueness for each individual read, which would be at least ~100x coverage for 150-bp reads... that coverage estimate is weighted by the high-coverage genomes, though.

              It's hard to say whether or not to sequence more based on this plot alone. You're obviously still generating more unique reads, but they might simply be giving more coverage to areas you can already assemble well. I think the best course is to assemble and see if you end up with a lot of short, low-coverage contigs (in addition to the high-coverage contigs that you will clearly generate)... in which case you do need to sequence more
              would this tool be suitable for understanding saturation for long-reads (i.e. PacBio, ONT)? Or is there smoother other tool.

              Now I am subsample various read numbers and mapping them to see tail off in number of unique genes discovered. But of course this takes time, and your Kmer uniqueness looks quick.

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                Yesterday, 10:05 AM
              • SEQadmin2
                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                by SEQadmin2


                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                Introduction

                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                05-22-2026, 06:42 AM
              • SEQadmin2
                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                by SEQadmin2

                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                05-06-2026, 09:04 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, Yesterday, 12:03 PM
              0 responses
              19 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, Yesterday, 11:40 AM
              0 responses
              14 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 05-28-2026, 11:40 AM
              0 responses
              29 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 05-26-2026, 10:12 AM
              0 responses
              31 views
              0 reactions
              Last Post SEQadmin2  
              Working...