Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to screen sequences for a certain fold coverage?

    The default smrt pipeline only outputs CCS reads that are >=3x coverage. But I would like to require that CCS reads have >=5X coverage. I'm wondering if anyone know how to modify the SMRT portal to achieve that goal (or if there is other software/script for this purpose). Thanks.
    Last edited by metheuse; 06-04-2013, 07:26 AM.

  • #2
    pbh5tools has the functinality you need:
    GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

    GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.


    Within SMRT portal a way to achieve the same filter, if you have a fixed amplicon size, is to limit the read length to 5x the expected insert size.

    Comment


    • #3
      Originally posted by rhall View Post
      pbh5tools has the functinality you need:
      GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

      GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.


      Within SMRT portal a way to achieve the same filter, if you have a fixed amplicon size, is to limit the read length to 5x the expected insert size.
      Thanks a lot! This is what I was looking for!
      Just several more simple questions:

      For a CCS pacbio run, what does this file "*_s*_p0.bas.h5" contain? I know "*_s*_p0.fasta" has the long reads, and "*_s*_p0.ccs.fasta" has the subreads.

      If I use pbh5tools and I want to output NumberOfPasses, the input should be "*_s*_p9.bas.h5", so the read type of it should be "CCS", right? I'm just confused by how this "bas.h5" can tell how many subreads are from a long read? Does it have information from both long read and subread?

      Sorry I'm fresh new for pacbio, so please forgive my stupid questions.

      Comment


      • #4
        The "*_s*_p0.fasta" and "*_s*_p0.ccs.fasta" files are generated from the "*_s*_p0.bas.h5" file.

        The standard file format for PacBio data is the bas.h5 (bax.h5 for RSII), it includes a much richer set of data than both the fasta files. It contains all the sequencing information, the entire read from each ZMW, kinetic information for methylation analysis and the rich QVs needed for some PacBio specific pipelines (Quiver consensus calling, HGAP assembly).
        The pdf here:
        GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

        has the full spec of the bas.h5 file.

        More useful info on PacBio basics:
        GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.


        In particular the Primary analysis one goes into bas.h5 details:

        Comment


        • #5
          Originally posted by rhall View Post
          The "*_s*_p0.fasta" and "*_s*_p0.ccs.fasta" files are generated from the "*_s*_p0.bas.h5" file.

          The standard file format for PacBio data is the bas.h5 (bax.h5 for RSII), it includes a much richer set of data than both the fasta files. It contains all the sequencing information, the entire read from each ZMW, kinetic information for methylation analysis and the rich QVs needed for some PacBio specific pipelines (Quiver consensus calling, HGAP assembly).
          The pdf here:
          GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

          has the full spec of the bas.h5 file.

          More useful info on PacBio basics:
          GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.


          In particular the Primary analysis one goes into bas.h5 details:
          http://aa314.gondor.co/webinar/primary-analysis/

          Thanks a lot!
          I asked those questions because I was kind of confused when seeing this parameter in pbh5tools:
          Code:
            --readType READTYPE   read type (CCS or Raw) [Raw]
          My data is CCS, so this parameter should be "CCS". Does "Raw" actually mean other types of data?

          Comment


          • #6
            I see the confusion, there is no difference between a CCS run and a non CCS run. The only sequencing variables are movie time and chemistry. The bas.h5 includes the raw read sequence and the CCS sequence (if more then 3 passes have been made of the inserted DNA between the adapters).

            Comment


            • #7
              Originally posted by rhall View Post
              I see the confusion, there is no difference between a CCS run and a non CCS run. The only sequencing variables are movie time and chemistry. The bas.h5 includes the raw read sequence and the CCS sequence (if more then 3 passes have been made of the inserted DNA between the adapters).
              Thanks so much!

              Comment


              • #8
                Hi metheuse,

                By any chance, are you in Ann Arbor? I am also new to PacBio and am trying to do the exact same thing, funny this SeqAnswers thread exists just yesterday. I'm a post-doc at UM, so if you are local and want to work together at this shoot me an email at [email protected].

                Comment


                • #9
                  Hi, I have further questions on this thread.
                  The default smrt analysis pipeline keeps the long reads that have at least 3 passes of CCS. For my project, I want to get >=6 passes.
                  I know I can filter them by the polymerase read length (eg. >= 6xCCS + 7xadpater). But the length of each amplicon differs, so I want to filter each long read based on its own CCS length.
                  I know bash5tools.py can filter the reads by number of passes, but it only outputs fasta, fastq, or data format result. After filtering, I want to use blasr to map the reads. Blasr has the option to map CCS first, and then map the subreads to the window that CCS maps, so I think I'd better use bas.h5 instead of fastq as the input.
                  So the question is, how to filter CCS reads by minimum number of passes and output the filtered data as bas.h5?
                  Thanks.

                  Comment


                  • #10
                    Interesting use case, unfortunately not something that is straightforward. I can think of an around about way:
                    Generate a list of the read ids that have >=6 passes using bash5tools.py:
                    https://github.com/PacificBioscience.../doc/index.rst
                    Parse into whitelist format:
                    https://github.com/PacificBioscience...filtering-step
                    Then run filtering with the whitelist to produce a region file:
                    Code:
                    filterPlsH5 input.fofn –filter=’ReadWhitelist=<file containing list of included reads>’
                    This will produce a reads.rgn.h5 and filtered_regions.fofn file. You can then run blasr:
                    Code:
                    blasr input.fofn reference.fasta –regionTable filtered_regions.fofn ...
                    Hope that helps
                    Richard.

                    Comment


                    • #11
                      Thanks a lot, Richard! This is exactly what I need!
                      One question: should I use bas.h5 or pls.h5 for the input to filterPls.h5? My understanding is that alignment only involves bases, right?
                      I just tried using both pls.h5 and bas.h5 for filterpls.h5. They both gave me the rgn.h5 and .fofn outputs. But they both said this:
                      No handlers could be found for logger "pbpy.io.MovieHDF5IO"
                      Can this be a problem?
                      Thanks!

                      Comment


                      • #12
                        By the way, how do you know all of this? I searched online really hard but couldn't find much helpful information. The manuals of most smrtanalysis related programs don't have many details.

                        Comment


                        • #13
                          I work for PacBio. We don't always do a great job of documenting non standard workflows, we are attempting to address this with the devnet / github content.
                          filterplsH5.py is so named for historic reasons and works perfectly well with bas.h5 files. the error you are seeing should not be a problem, it is due to logging and not the actual filtering.

                          Comment


                          • #14
                            I see. Thanks a lot for your help, Richard.
                            For blasr, I should let it output sam format, then convert it to h5 for quiver, right?

                            My project is to find low freq variants by PacBio's CCS method. So the pipeline is:
                            filter certain numbers of passes by bash5tools -> mapping by blasr -> samtoh5 -> variant calling by quiver

                            Do you have any resource that I can read for parameter settings in each step? Especially, in blasr and quiver, there are a lot parameters to control the mapping quality and stringency, variant quality, etc.

                            For example,

                            1. the "-best" parameter that controls how many top alignments to report, the default is 10, but I'd better set it to 1, right?

                            2. For quiver, I'm reading the FAQ at: https://github.com/PacificBioscience.../QuiverFAQ.rst
                            This is a helpful page and it gives some recommendation for some parameter settings. But if you know more than it, please let me know, especially for the following parameters:

                            --coverage COVERAGE, -X COVERAGE
                            A designation of the maximum coverage level to be used
                            for analysis. Exact interpretation is algorithm-
                            specific.

                            --mapQvThreshold MAPQVTHRESHOLD, -m MAPQVTHRESHOLD

                            --variantConfidenceThreshold VARIANTCONFIDENCETHRESHOLD, -q VARIANTCONFIDENCETHRESHOLD

                            --variantCoverageThreshold VARIANTCOVERAGETHRESHOLD, -x VARIANTCOVERAGETHRESHOLD

                            --referenceChunkOverlap REFERENCECHUNKOVERLAP

                            --noEvidenceConsensusCall {nocall,reference,lowercasereference}
                            The consensus base that will be output for sites with
                            no effective coverage.
                            etc.

                            Sorry that I have too many to ask. This is my first time to use quiver.
                            Thanks so much.

                            Comment


                            • #15
                              blasr -> quiver is not as simple as converting to cmp.h5 from sam, the cmp.h5 for use in quiver has to have a lot of extra information. blasr has always been considered an underlying program, in standard PacBio pipelines compareSequences.py is used as a wrapper for blasr. A new program pbalign: https://github.com/PacificBioscience.../doc/howto.rst
                              Takes all the headaches out of aligning data for use with quiver. maxHits, (similar to bestn, but blasr has two parameters that interplay nCandidates & bestn) should be set to 1.
                              For quiver I would just stick to defaults as a first pass:
                              variantCaller.py --algorithm=quiver <cmp.h5> -r <reference> -o <gff> -o <consensus.fasta> -o <consensus.fastq>

                              One question, why this complex workflow, are you mapping to something that is highly repetitive? blasr and quiver are designed to work with normal read data. Also quiver will only call haploid variants.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X