Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there any reason not to run RS_ReadsOfInsert?

    I've included some background info after the questions, which are first in cases of TL;DR

    Questions:
    1. Is it generally a best-practice to run CircularConsensus on SMRT cell DNAseq data before doing an analysis such as scaffolding or genome assembly?

    2. Under what circumstances would you not run CircularConsensus?


    Background:
    Our collaborators sent us 9.4Gbps (12 SMRT cells) of plant DNA sequencing from RSII (P6C4, I think). We estimate this represents ~20x coverage of the plant's genome. All of the initial processing was done by the collaborators. Their last step was the filtering of subreads, which have a post-filter N50 of around 8,000.

    My Goal:
    I am trying to use the reads to gap-fill and do additional scaffolding of a draft genome assembly.

    Results:
    I ran PBjelly using uncorrected subreads providing the different Analysis_results directories for each SMRT cell. The results seem very good - about half of the gaps were filled and the scaffold N50 increased by 20%.

    But I suspect that some of the filled gaps, especially those in repetitive areas, are not correct. When I look at the subread placement over each gap (produced by PBjelly), I noticed that some were filled, for example, by a minor proportion (N=2) the total subreads (N=9) from a ZMW. There were a few instances like this. It occurred to me that maybe it was a mistake to use the subreads rather than consensus sequences.

  • #2
    I would ALWAYS run consensus and use only reads of insert reads rather than filtered subreads, for any purpose.

    Comment


    • #3
      There is no need to run circular consensus for denovo genome assembly or scaffolding, the power in the data is in the read length, not in the base accuracy, and calculating CCS drastically reduces the yield of long reads. For confident gap filling the gap should be spanned by multiple subreads from different ZMWs, if all the subreads spanning a gap are from the same ZMW (molecule) then there is a chance it is a biological chimera and not a real closure.

      Comment


      • #4
        Self-consensus will make the mapping faster and more accurate, and simplify the problem of detecting subreads that came from same molecule (since they won't exist anymore). Long reads will not be lost in the Reads Of Insert protocol set to require no minimum number of passes.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          I would ALWAYS run consensus and use only reads of insert reads rather than filtered subreads, for any purpose.
          Thanks much for the reply. When you run CircularConsensus, what quality threshold (or paramaters) do you typically use?

          Comment


          • #6
            Originally posted by rhall View Post
            There is no need to run circular consensus for denovo genome assembly or scaffolding, the power in the data is in the read length, not in the base accuracy, and calculating CCS drastically reduces the yield of long reads.
            Thanks much for the reply. I tried running CircularConsensus on a single bax.h5 file, and I see what you mean about reduced read lengths.

            Originally posted by rhall View Post
            For confident gap filling the gap should be spanned by multiple subreads from different ZMWs, if all the subreads spanning a gap are from the same ZMW (molecule) then there is a chance it is a biological chimera and not a real closure.
            I'll have a look through the PBjelly files and check how many ZMW/subreads support each gap-filling.

            As a follow-up, if a gap is filled in part by a particular subread from a particular ZMW should I expect to see all other subreads from that ZMW also fill in the gap?

            It seems reasonable to expect that they would, at least in theory. But I was looking in to a case where one of my gaps was filled by 2 full-length subreads out of the 7 full-length subreads (9 total subreads). For this particular ZMW, the subreads had a relatively low quality (RQ=0.78), and some of the full-length subreads had substantially different lengths (median 7000bp, min 3000bp, max 9000bp). Are there other factors that you'd recommend I check when I evaluate the subreads filling a gap?

            Comment


            • #7
              I don't personally run any PacBio software, I just deal with the files downstream, so I don't know the exact flags. But basically, I suggest you run with 0 required passes and a minimum quality of 0, so you don't discard anything. Do the longest reads still disappear in that scenario? According to what I have been told by PacBio, they shouldn't...

              Comment


              • #8
                Originally posted by Brian Bushnell View Post
                I don't personally run any PacBio software, I just deal with the files downstream, so I don't know the exact flags. But basically, I suggest you run with 0 required passes and a minimum quality of 0, so you don't discard anything. Do the longest reads still disappear in that scenario? According to what I have been told by PacBio, they shouldn't...
                I ran CircularConsensus (ConsensusTools v2.3.0.149240) using these options:

                Code:
                ConsensusTools.sh CircularConsensus -n 16 \
                --logFile=test_ccs.log \
                --minFullPasses 0 --minPredictedAccuracy 0 \
                m160611_100724_42219_c101002732550000001823227509161692_s1_p0.1.bax.h5
                And after it finished, this is what was in the log file:

                Code:
                # 01:00:18 [CircularConsensus] Result Report for the 54494 Zmws processed
                # Zmw Result                                            #-Zmws     %-Zmws
                # Successful - Quiver consensus found                   8916       16.36 %
                # Successful - But only 1 region, no true consensus     16500      30.28 %
                # Failed - Exception thrown                             0          0.00 %
                # Failed - ZMW was not productive                       28289      51.91 %
                # Failed - Outside of SNR ranges                        753        1.38 %
                # Failed - No insert regions found                      0          0.00 %
                # Failed - Not enough full passes                       0          0.00 %
                # Failed - Insert length too small                      0          0.00 %
                # Failed - Post POA requirements not met                0          0.00 %
                # Failed - CCS Read below predicted accuracy            0          0.00 %
                # Failed - CCS Read was palindrome                      36         0.07 %
                # Failed - CCS Read below SNR threshold                 0          0.00 %
                # Failed - CCS Read too short or long                   0          0.00 %
                Looking at the distribution of sequence lengths of the subreads vs the ccs reads, it seems like all of the largest sequences were preserved (see pics attached). So, thanks for the suggestion!

                The only odd thing from this small experiment was that the number of ZMWs represented in the CCS file (N=25338) was smaller than what is indicated by the output log (N_successful= 8916 + 16500 = 254160). So, there might be some additional filtering step (maybe based on length?) before the reads are actually output to file.
                Attached Files

                Comment


                • #9
                  As a follow-up, if a gap is filled in part by a particular subread from a particular ZMW should I expect to see all other subreads from that ZMW also fill in the gap?
                  There will be significant noise in this analysis and I wouldn't expect all subreads to perform the same with regards to the alignment.

                  Running ROI with 0 passes and minimum quality of 0 is not a recommended protocol and I'm unsure of any use case for this kind of data. CCS data should only be used when a high single molecule accuracy is needed (minor variant detection, 16S, pseudo-gene differentiation). In these cases I also would not recommend using the old ROI pipeline, the new CCS2 algorithm will give much better results.

                  Comment


                  • #10
                    Originally posted by rhall View Post
                    Running ROI with 0 passes and minimum quality of 0 is not a recommended protocol and I'm unsure of any use case for this kind of data.
                    It reduces your volume of data to give fewer, higher-quality short sequences, while keeping the long sequences. This reduces computational requirements and increases accuracy of alignment for any given read, while removing complications due to copies of chimeric molecules being presented as independent. What's not to like? I'd prefer that data for any use-case.

                    In these cases I also would not recommend using the old ROI pipeline, the new CCS2 algorithm will give much better results.
                    I'm not aware of that; I'll have to look into it.

                    Comment


                    • #11
                      Originally posted by rhall View Post
                      There will be significant noise in this analysis and I wouldn't expect all subreads to perform the same with regards to the alignment.

                      Running ROI with 0 passes and minimum quality of 0 is not a recommended protocol and I'm unsure of any use case for this kind of data. CCS data should only be used when a high single molecule accuracy is needed (minor variant detection, 16S, pseudo-gene differentiation). In these cases I also would not recommend using the old ROI pipeline, the new CCS2 algorithm will give much better results.
                      Thanks for the info. In the future, I'll check out the new consensus algorithm for future work (presumably this is the pbccs / unanimity module).

                      I do find it appealing to use CCS reads rather than the raw subreads mostly because the HPC I have access to is relatively old and slow. Also, after I scaffold and gap-fill with the raw subreads I realized I am left with a genome that has spans where pacbio was used that have a higher error rate than the surrounding regions. I should have done some pre-correction of the subreads beforehand perhaps, but I didn't and rerunning the process using CCS reads was one option I'd been exploring.

                      Again, many thanks for your explanations.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X