Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • swatie
    Junior Member
    • Jan 2018
    • 1

    How to evaluate the Polymerase read and subread statistics from the raw data?

    Hi

    How can I check before assembly that the PacBio data provided by service provider is good. What I have received from my service provider is the raw data and two types of statistics:
    Polymerase read statistics
    Subread statistics

    Thanks
    Last edited by swatie; 01-10-2018, 06:36 PM. Reason: Want to add type of data
  • rhall
    Senior Member
    • Aug 2012
    • 324

    #2
    The subread statistics are the important numbers for assembly. You should have enough coverage in the subread bases, and good subread mean / N50 readlingth.

    The subread stats will depend on the library quality (how much long DNA was in the sample) as well as generally sequencing variables.

    Comment

    • swatisinha
      Junior Member
      • Jan 2018
      • 1

      #3
      Hi, Thank you for the reply.
      So the subread statistics are;
      Subread bases Subreads Subread N50 Avg. subread length
      sample1_cell1 1,420,065,796 163,879 12166 8665
      sample1_cell2 1,314,563,766 150,564 12,399 8,730

      so the average subread length/ N50 subread for cell1 is 8665/12166 (0.71)
      and for cell2 is 8730/12,300 (0.70)
      Are these good? What is the threshold to decide these values good or bad ? (average genome size of such strains is 36Mb)

      In addition, I ran the RS_Subreads protocol from the SMRT portal from the raw data. The loading P1 values for cell is 77.45% (P0 is 11.06% and P2 is 11.49%) and for cell2 is 73.59% (P0 is 9.48% and P2 is 16.94%). In general, I guess a good P1 is between 30-40%, greater than 40% value of P1 gives a large number of shorter contigs, right ?

      ( The Chemistry used was P6-C4)

      Comment

      • rhall
        Senior Member
        • Aug 2012
        • 324

        #4
        The P1 is high, probably a sign of overloading. This results in low quality data, and limited subread lengths.
        It's always worth assembling the data, even if the raw data isn't of the absolute highest quality. Simply run HGAP with 36Mb as the estimated genome size.
        With the preassembly stats and assembly results it should be possible to estimate how much the data quality has effected the results.

        Comment

        • am33567
          Junior Member
          • Mar 2018
          • 1

          #5
          Dear Dr. Hall,
          We have a similar problem than swatisinha. We got sequel data for a de novo genome assembly, in total we have 30Gb of sub-reads with mean size 8.7kb. Our lab outsourced a bioinformatics service company to run Falcon, but after trying all the parameters the assembly is not good, with N50 around 70kb. We have discarded heterozygosity and contamination. We believe our raw data is flawed and want to test this, Would you, by any chance, have any advice on how to measure the quality of our reads?

          Here are some of the stats from the sequencing facility. Also, Well A01 was done on a different Run than the rest.
          Samples -- Yield (GB) -- Pol_RDL_mean -- sub-RDL mean -- ZMW_0 % -- ZMW_1 % -- ZMW_2 % -- Pippin cutoff
          Well A01 -- 2.55 -- 13,144 -- 10,763 -- 70.32 -- 19.44 -- 10.24 -- 15kb
          Well B01 -- 6.08 -- 8,585 -- 7,640 -- 13.10 -- 69.40 -- 17.50 -- 15kb
          Well C01 -- 7.13 -- 9,759 -- 8,372 -- 13.70 -- 71.70 -- 14.60 -- 15kb
          Well D01 -- 7.55 -- 10,546 -- 8,873 -- 17.40 -- 70.20 -- 12.40 -- 15kb
          Well E01 -- 6.87 -- 9,046 -- 7,908 -- 11.30 -- 74.50 -- 143 -- 15kb

          Thanks so much for any insight you may have.
          Last edited by am33567; 03-19-2018, 06:07 PM. Reason: Spaces

          Comment

          • nucacidhunter
            Jafar Jabbari
            • Jan 2013
            • 1250

            #6
            Assembly stats is not indicator of sequence quality as it will depend on input DNA heterozygosity, genome composition, assembly software and coverage among other factors.

            Well A01 is underloaded and others overloaded. But your sequencing QC looks Sequel average. In PacBio systems read length and yield generally have inverse correlation. If they were loaded according to PacBio recommendation of P1~35-45% you would get around 1Kb longer subreads for B-E wells in the expense of yield and 1kb shorter for A01 with increased yield.

            Comment

            • rhall
              Senior Member
              • Aug 2012
              • 324

              #7
              The data looks a little overloaded, but for sequel these number do not look unreasonable.
              The mean subread size does look low, but given a 15kb size selection, it's within expectations. Was there issues with the initial DNA quality? For denovo assembly a much higher size selection would be recommended.
              How did you
              discarded heterozygosity and contamination
              ?
              The first place I would be looking in this case is the assembly, not the raw data. What do the assembly parameters look like, preassembeld seed cut, yield, N50, coverage, Draft assembly size, repeat content etc.

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                Here are nine questions we think about, in roughly the order they matter, before...
                Today, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM
              • SEQadmin2
                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                by SEQadmin2


                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                Introduction

                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                05-22-2026, 06:42 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, Yesterday, 06:09 AM
              0 responses
              16 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              36 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              42 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              49 views
              0 reactions
              Last Post SEQadmin2  
              Working...