Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • krittika.sasmal
    Junior Member
    • Jan 2012
    • 5

    DeNovo assembly using pacBio data

    Hi,
    I am new to PacBio. I have visited the pacBio site. However could not clearly understand what files are generated by the primary analysis or base calling. Is it the CLR or CCS files? how do we actually go about generating the filtered read files. If I am interested to use the SMRT pipe, what input does it take?

    Also, can anybody please suggest a de novo assembler that works well with PacBio data?
  • SillyPoint
    Member
    • May 2008
    • 39

    #2
    There's a high-level how-to on pacbiodevnet, describing how they used a combination of consensus and long reads to do de novo assembly of an E.coli strain.

    As I understand it, the consensus reads are generated during the primary analysis step, which occurs on the PacBio server. The main product of that step is the *.bas.h5 file, which includes basecalls, several quality scores, and limited kinetics info. At this point, adapters have been identified, and reads can be split into sub-reads.

    The secondary analysis, running on your cluster, does filtering of subreads based on productivity, high-quality region length and score, to produce a filtered_subreads.fasta file. It also removes control reads (although those are still prersent in filtered_subreads.fasta -- I think), and then performs an alignment to the reference provided in the protocol to produce a BAM file and various other stuff.

    That's my view-from-40000-feet, anyway. Perhaps a helpful PacBio person will wander by and provide a bit more detail.

    Comment

    • krobison
      Senior Member
      • Nov 2007
      • 734

      #3
      MIRA and Celera Assembler are two other assemblers which support de novo assembly using PacBio, and perhaps more importantly mixing PacBio with other technologies.

      Comment

      • jbingham
        Member
        • Jul 2011
        • 24

        #4
        PacBio's base caller outputs sequence data in HDF5 format, PacBio's native data format. The HDF files contain base calls for both long reads and circular consensus (if applicable, meaning the reads wrapped around the adapters), as well as quality scores and kinetic measurements. PacBio provides APIs in Python, R and Java for accessing the files. You can download them from www.pacbiodevnet.com.

        When using PacBio's secondary analysis pipeline, you'll get alignments in SAM/BAM+BAI, coverage in BED and GFF, variant calls in GFF and VCF, as well as the FASTA/FASTQ for filtered subreads.

        Comment

        • krittika.sasmal
          Junior Member
          • Jan 2012
          • 5

          #5
          Pipeline for Corrected Long read generation

          Thank you for all your answers. Can anybody tell me the pipeline to be followed to generate ths error corrected CLR reads. I have downloaded the SMRT pipe.
          Moreover how are the filtered reads generated? Help me out with the SMRT analysis. I downloaded E.coli raw reads from DevNet. However there seem to be several bas.h5 files. Do I combine them and proceed.
          What are the parameters BLASR takes. Can anybody help?

          Comment

          • jbingham
            Member
            • Jul 2011
            • 24

            #6
            Your best bet will be to use the FASTQ files rather than the raw bas.h5 files. You can download them from the E coli page here:
            Open-source community-developed analysis tools for PacBio SMRT Sequencing data, example data sets, and resources


            For example, you could grab the two filtered subread files for C227-11: one is for CCS, the other for long reads.

            Note that you could download the error-corrected version of the reads as a FASTQ from the same page. To do the error correction yourself, your best bet is pacBioToCA and the Celera Assembler. There are links to them both here:
            Open-source community-developed analysis tools for PacBio SMRT Sequencing data, example data sets, and resources


            One reason is that PacBio's error correction pipeline will be incorporated in the next software release.

            Also, Mike Schatz's presentation is really useful:

            Comment

            • krittika.sasmal
              Junior Member
              • Jan 2012
              • 5

              #7
              @jbingham- Thanks loads. I could find that there is a pacBio.spec file that is required. It is not there for any of the reads downloaded from PacBio DevNet. Is it always supplied with the data, as is written in the manual (infact I doubt it..).
              can you shed some more light on it?

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #8
                Krittika,

                We discovered that the CLI for the PacBio SMRT analysis software is not fully supported by PacBio (at least that was our experience). We were trying to use the CLI and ran into problems that only the developers could answer. But we never received satisfactory answers. You also need to use some settings xml files that are difficult to reproduce by hand so I would advise staying away from the CLI for the current version of SMRT analysis.

                That said, SMRTanalysis software does work through the SMRTPortal web interface they provide (which has its own problems since there is no good security model but if you are the only user then it may not be an issue). So your best bet may be to install that and move forward.

                You can set up some of the hybrid assembly through the SMRTportal interface (we are in the process of trying it now). They do recommend having a cluster to run this on so I hope you have access to one and are planning to do this work there.


                Originally posted by krittika.sasmal View Post
                Thank you for all your answers. Can anybody tell me the pipeline to be followed to generate ths error corrected CLR reads. I have downloaded the SMRT pipe.
                Moreover how are the filtered reads generated? Help me out with the SMRT analysis. I downloaded E.coli raw reads from DevNet. However there seem to be several bas.h5 files. Do I combine them and proceed.
                What are the parameters BLASR takes. Can anybody help?

                Comment

                • jbingham
                  Member
                  • Jul 2011
                  • 24

                  #9
                  The pacbio.spec file is specific to Celera Assembler. PacBio's pipeline doesn't generate it. Examples are available for SGE



                  and for high memory instances



                  Once you've got a working spec file, you should be able to use it for all analyses.

                  If you use the error corrected reads as your starting point, you can run the SMRT Portal GUI directly, as @GenoMax suggested. You cannot yet do error correction through the GUI. You'll have to do it from the command-line.

                  Comment

                  • jbingham
                    Member
                    • Jul 2011
                    • 24

                    #10
                    One more tip: there's also a C++ API to read PacBio HDF files. It's located in the SMRT Analysis source download in

                    cpp/common/data/hdf/HDFBasReader.h

                    Comment

                    • rghan
                      Junior Member
                      • Mar 2011
                      • 9

                      #11
                      question regarding quality scores

                      Apologies if this is a rather naive question, but http://oelemento.wordpress.com/2011/...uence-dataset/ mentioned that PacBio fastq files contain quality scores (c) for each nucleotide in each read. We are not seeing any quality scores in our initial analysis. Any help or suggestions would be greatly appreciated.

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        #12
                        Current default output of SMRTanalysis is fasta format files as you have noticed. "fastq" format sequence files would be produced as default by a future version of SMRT analysis package but in the mean time you can get quality values from the *.bas.h5 files by using the script PacBio posted here: https://github.com/PacificBiosciences/pbh5tools/

                        Tom Skelly from Sanger recently posted a set of useful scripts for PacBio here: https://github.com/TomSkelly/PacBioEDA

                        Originally posted by rghan View Post
                        Apologies if this is a rather naive question, but http://oelemento.wordpress.com/2011/...uence-dataset/ mentioned that PacBio fastq files contain quality scores (c) for each nucleotide in each read. We are not seeing any quality scores in our initial analysis. Any help or suggestions would be greatly appreciated.

                        Comment

                        • SillyPoint
                          Member
                          • May 2008
                          • 39

                          #13
                          Actually, the answer to rghan's "question regarding quality scores" is tougher than it looks. First off, it depends on what you want the fastq file to contain.

                          If you're after the circular consensus reads, that exists as Analysis_Results/<MovieName>.ccs.fastq.

                          But if it's the individual raw reads you're after, what do you want to see? All the bases from all the reads? Probably not: you can't feed that to an aligner, for example. You probably want the raw reads to be split up into subreads of contiguous sequence, with the adapters removed. And you probably want only productivity-1 reads. I.e., you want the fastq equivalent of the filtered_subreads.fasta file produced by secondary analysis.

                          pbh5tools won't give you that, I'm afraid. (Nor will my package ). "bash5tools.py --outType fastq --readType Raw" produces a fastq file containing all the bases from all the reads, unfiltered and un-split.

                          You could extract a fastq file from aligned_reads.sam. But that gives you just what it says: only the sub-reads which secondary analysis managed to align.

                          The next question is: What do those Q scores mean, anyway?

                          The bas.h5 file includes 4 separate probability scores for each basecall: substitution, insertion, deletion Q-probabilities, and an overall "QualityValue". The first three are easy to understand, but I've never been clear on what the 4th one represents. That's the score you see in the SAM and pbh5tools files.

                          I've heard it said that QualityValue is the Q-encoded combination of the first three probabilities. But looking at data, that doesn't appear to be true. (Can't read the code: it's part of primary analysis, not released by PacBio ).

                          And in any case, what do you make of the deletion probability? That's the prob that this basecall may have been followed (preceded?) by a missed base. That doesn't tell you anything about the validity of the basecall itself.

                          Perhaps some helpful PacBio person can shed a bit more light on all this.

                          --TS

                          Comment

                          • krittika.sasmal
                            Junior Member
                            • Jan 2012
                            • 5

                            #14
                            Quality scores in the pacbio .fastq files

                            Hi, I wanted to know what kind of quality scores are there in a fastq file from pacbio? PHRED 32 /64? or is it Sanger type quality scores?

                            Comment

                            • SillyPoint
                              Member
                              • May 2008
                              • 39

                              #15
                              AFAIK, any ascii-encoded Q scores in fastq or SAM files will be encoded Q+33.

                              See last post for caveats about quality scores, however.

                              --TS

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                Yesterday, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              20 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              38 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              44 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...