Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    hmm, I just saw this too, quiver only calls haploid variants. My data is human. The reason to use pacbio ccs instead of illumina is that we want to find some really low freq small indels. We think CCS should be much more accurate.

    So can I use Plurality? Or do I have to go back to GATK? It seems that the gatk module in smrtanalysis pipeline cannot find indels.

    So you suggest using compareSequences.py instead of blasr?

    Once I get alignment from blasr as sam files, could I just leave smrtanalysis modules and use GATK or samtools to call variants? But still I don't know how different it will be compared to illumina reads.
    Last edited by metheuse; 08-30-2013, 06:03 AM.

    Comment


    • #17
      How low frequency do you expect the variant? lower than 1/100 reads will be very difficult.
      The last time I checked (~1 year ago) GATK will not find indels with PacBio data, the local realignment fails. Also I'm not sure GATK will find rare variants, I know it calls diploid, but I have never seen it used for rare variants, with PacBio data.

      How long is your amplicon, and how many indels do you expect?
      One option if the indel is more than a few bases, is to take high pass CCS data then look at the read length distribution.
      Or:
      If you know the expected variant, but want to calculate frequency, take high pass CCS data and map to multiple reference that account for the indel and measure coverage depth.
      Or:
      If the indel is novel and only a few bases, map high pass CCS reads to a reference, using pbalign (the replacement for compareSequences.py) mapping the CCS, not using the CCS to place the subread. Then simply count bases at each position in the alignment, applying a statistical test to the significance of variation. This last part may require writing some code, I'm not sure of any off the shelf solution, but there is possibly something that works with SAM files, maybe from the viral space.

      Comment


      • #18
        This is strange.

        I used bash5tools.py on bas.h5 file and extracted the CCS reads that have at least 3 passes. I compared the output with the p0.ccs.fasta file. The numbers are different. There are fewer reads in bash5tools.py output.

        And if I set --readType to "Raw" for bash5tools.py, the output has more reads than p0.fasta file. The p0.fasta file should contain all the raw reads, right?

        Comment


        • #19
          Originally posted by rhall View Post
          How low frequency do you expect the variant? lower than 1/100 reads will be very difficult.
          The last time I checked (~1 year ago) GATK will not find indels with PacBio data, the local realignment fails. Also I'm not sure GATK will find rare variants, I know it calls diploid, but I have never seen it used for rare variants, with PacBio data.

          How long is your amplicon, and how many indels do you expect?
          One option if the indel is more than a few bases, is to take high pass CCS data then look at the read length distribution.
          Or:
          If you know the expected variant, but want to calculate frequency, take high pass CCS data and map to multiple reference that account for the indel and measure coverage depth.
          Or:
          If the indel is novel and only a few bases, map high pass CCS reads to a reference, using pbalign (the replacement for compareSequences.py) mapping the CCS, not using the CCS to place the subread. Then simply count bases at each position in the alignment, applying a statistical test to the significance of variation. This last part may require writing some code, I'm not sure of any off the shelf solution, but there is possibly something that works with SAM files, maybe from the viral space.
          Thanks! I think I'll simply try GATK first. If it doesn't work, I'll try the solution you suggested.

          One question, for variant detection purpose, when I do alignment, should I align only ccs, or subreads? It seems that you suggested aligning only CCS.

          Comment


          • #20
            For rare variants it is easy to use CCS reads, the statistics are simpler, as you are essentially reducing the amount of noise in the alignment.

            For haploid variants I would use quiver and raw reads, the rich quality information in the raw reads together with the quiver algorithm produces great results, with the advantage that you can use all the data.

            The subreads.fasta file and the ccs.fasta files output by primary will have different numbers of reads than bash5tools output as they are subject to a read quality filter, which is also a parameter in bash5tools. I'm not sure off the top of my head what the filter is for the primary produced files.

            Comment


            • #21
              Originally posted by rhall View Post
              For rare variants it is easy to use CCS reads, the statistics are simpler, as you are essentially reducing the amount of noise in the alignment.

              For haploid variants I would use quiver and raw reads, the rich quality information in the raw reads together with the quiver algorithm produces great results, with the advantage that you can use all the data.

              The subreads.fasta file and the ccs.fasta files output by primary will have different numbers of reads than bash5tools output as they are subject to a read quality filter, which is also a parameter in bash5tools. I'm not sure off the top of my head what the filter is for the primary produced files.
              Thanks!

              Since my sample is human, I'll use ccs alignment only.

              If there is another filter for ccs.fasta, then the number of reads in it should be smaller than the output of bash5tools.py, right? But the fact is the opposite.

              Here are the parameters of bash5tools.py:

              Code:
              optional arguments:
                -h, --help            show this help message and exit
                -i, --info            turn on progress monitoring to stdout [False]
                -d, --debug           turn on progress monitoring to stdout and keep temp files [False]
                -v, --version         show program's version number and exit
                --outFilePref OUTFILEPREF
                                      output filename prefix []
                --readType READTYPE   read type (CCS or Raw) [Raw]
                --outType OUTTYPE     output file type (fasta, fastq, csv or detail) [fasta]
                --subReads            for fasta/fastq files, generate the subreads and not the raw reads [False]
              
              read filtering arguments:
                --minLength MINLEN    min read length [0]
                --minReadscore MINRS  min read score, valid only when used with --readType=Raw [0]
                --minMeanQV MINMQV    min per read mean Quality Value [0]
                --minNumberOfPasses MINNP
                                      min number of CCS passes, valid only when used with --readType=CCS [0]
              I want to filter based on number of passes, so I should use "CCS" as the readType, right? I tried "raw", but it outputs even more reads than the long read fasta file.

              My command is:
              Code:
              python bash5tools.py mydata.bas.h5 --readType CCS --outType fasta --minNumberOfPasses 6 --outFilePref filter6passesCCS_s1
              Is there any recommended setting for the other thresholds?

              Another question: how can I combine two bas.h5 files (s1 and s2 for each well)? I found how to merge cmp.h5, not bas.h5.
              Or could I filter them separately by bash5tools.py, and then put their names in an fofn file, and use this fofn file as input for alignment, and similarly, use an fofn rgn.h5 file for --regionTable?

              I hope I'm not bugging you too much, Richard.
              Last edited by metheuse; 08-30-2013, 10:13 AM.

              Comment


              • #22
                No problem,
                The subtleties of the filtering are complicated, it's something I'd have to look into, but the ccs.fasta file from primary is actually being removed in the next software release, I wouldn't use it.

                I would also set --minMeanQV 25 (but make sure you don't loose too many reads), as a high number of passes could in theory produce a read with low QVs in unusual situations.

                For multiple bas.h5 files, it should always be possible to give the program an input.fofn file for multiple bas.h5 files. It is not possible to merge bas.h5 files.

                Comment


                • #23
                  Originally posted by rhall View Post
                  No problem,
                  The subtleties of the filtering are complicated, it's something I'd have to look into, but the ccs.fasta file from primary is actually being removed in the next software release, I wouldn't use it.

                  I would also set --minMeanQV 25 (but make sure you don't loose too many reads), as a high number of passes could in theory produce a read with low QVs in unusual situations.

                  For multiple bas.h5 files, it should always be possible to give the program an input.fofn file for multiple bas.h5 files. It is not possible to merge bas.h5 files.
                  Thanks!

                  By the way, the small indels we are aiming to find can be at a frequency as low as 0.0001.

                  Comment


                  • #24
                    At a frequency of 0.0001, that would be ~4-5 single reads showing the mutation in a single RS II run, assuming all reads produce CCS data.
                    If you need any more assistance let me know, we can also take this conversation over to email if you like.
                    Good Luck.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    23 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    21 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X