Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying heteroduplexes in PacBio CCS reads

    Hello,

    I'm working with circular consensus reads from a PCR product that may contain heteroduplexes. This is because the PCR was generated from a template that was a mix of minor variants. In the final annealing steps, strands amplified from different templates can anneal if they are similar enough. I'm expecting that a small percentage of the CCS reads may have two different strands that the PacBio 'Reads of Insert' method may assume are identical.

    The simplest way I thought of to deal with this would be to look at the FASTQ CCS reads and filter based on the phred scores contained there. While most of the CCS reads are going to be really high accuracy, a heteroduplex would have an equal number of subreads for each strand, making the probability of error high for nucleotides affected by heteroduplexes.

    My question is how the PacBio CCS error model works, and whether my strategy would work. I ask because I know there are many components of the PacBio error model that are consolidated into the FASTQ phred score, so it may not be as simple as I think.

    Thanks!
    Last edited by verheytb; 04-21-2015, 12:45 PM.

  • #2
    Substitutions are very rare in PacBio data, the predominant error being indels. So long as the minor variants are SNPs it should be straightforward to detect heteroduplexes from the CCS QV scores.
    I believe this has been tested before, I'll look to see if I can dig up any data on the magnitude of the effect on QV, it's not completely straightforward, given that the CCS QV's are not perfectly calibrated.

    Comment


    • #3
      Having looked at some data that was generated from known heteroduplexes, the only reliable way to detect mismatches using the QV is by focusing specifically on the Substitution QV, not the base QV that is calculated for the fastq file. To get access to this you will have to work from the ccs.h5 files which include all the QV values for every base.

      Comment


      • #4
        Originally posted by rhall View Post
        Having looked at some data that was generated from known heteroduplexes, the only reliable way to detect mismatches using the QV is by focusing specifically on the Substitution QV, not the base QV that is calculated for the fastq file. To get access to this you will have to work from the ccs.h5 files which include all the QV values for every base.
        Thanks, that's really helpful! There will also be the possibility of indel heteroduplexes. Since I'm filtering my data for 10 pass reads and higher, can I rely on the indel QVs at all?

        Comment


        • #5
          Unfortunately the indel QV at 10 passes will likely generate a lot of false positives. In particular you will see lots of low QVs for indels in homoployper regions.
          One option for very high pass CCS reads would be to flag based on QV, then calculate consensus for both forward and reverse independently, comparing the results.

          Comment


          • #6
            I like the idea of generating strand-specific CCS reads.

            Is there an easy way to get the forward and the reverse subreads? I see in the bas.h5 reference guide that each subread has information for each pass, including direction, but pbcore.io doesn't seem to have any documented way to access it.

            Also, what is the best way to generate the circular consensus from the set of subreads from a particular strand? I can't find documentation on how the P_ReadsOfInsert does it.

            Comment


            • #7
              Unfortunately 'it's easier said than done'
              I don't see any way to generate a quality aware consensus for forward and reverse strands using either CCS consensus of Quiver code.
              One method would be to extract the forward and reverse sequences from a filtered_subreads.fasta file generated using standard filtering, then aligning against a common reference using blasr (the alignment does not have to be high quality, you could simply use the first subread as a reference) then call consensus using pbdagcon . The problem is pbdagcon was really developed for speed and does not use the rich quality values that are used in CCS and Quiver consensus generation. I'm therefore not sure if the differences will be detectable above noise.

              Comment


              • #8
                OK, so it is possible to generate a high quality strand specific consensus, but this behavior seems to be broken post SMRT Analysis 2.3.0 patch2. I'm not sure exactly when it was broken, or how this relates to github versions of the tools, but assuming SMRT Analysis 2.3.0, align all the subreads to a reference using a standard pipeline, the using the resulting cmp.h5:
                Code:
                cmph5tools.py select --where "(Movie=='<movieName>') & (HoleNumber==<ZMW>) & (Strand==0)" --outFile <ZMW>_0.cmp.h5 aligned_reads.cmp.h5
                cmph5tools.py sort <ZMW>_0.cmp.h5
                quiver --referenceFilename <reference fasta> -o <output gff and/or fasta> <ZMW>_0.cmp.h5

                Comment


                • #9
                  I will give that a try! Thanks very much.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  68 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X