Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DeNovo assembly using pacBio data

    Hi,
    I am new to PacBio. I have visited the pacBio site. However could not clearly understand what files are generated by the primary analysis or base calling. Is it the CLR or CCS files? how do we actually go about generating the filtered read files. If I am interested to use the SMRT pipe, what input does it take?

    Also, can anybody please suggest a de novo assembler that works well with PacBio data?

  • #2
    There's a high-level how-to on pacbiodevnet, describing how they used a combination of consensus and long reads to do de novo assembly of an E.coli strain.

    As I understand it, the consensus reads are generated during the primary analysis step, which occurs on the PacBio server. The main product of that step is the *.bas.h5 file, which includes basecalls, several quality scores, and limited kinetics info. At this point, adapters have been identified, and reads can be split into sub-reads.

    The secondary analysis, running on your cluster, does filtering of subreads based on productivity, high-quality region length and score, to produce a filtered_subreads.fasta file. It also removes control reads (although those are still prersent in filtered_subreads.fasta -- I think), and then performs an alignment to the reference provided in the protocol to produce a BAM file and various other stuff.

    That's my view-from-40000-feet, anyway. Perhaps a helpful PacBio person will wander by and provide a bit more detail.

    Comment


    • #3
      MIRA and Celera Assembler are two other assemblers which support de novo assembly using PacBio, and perhaps more importantly mixing PacBio with other technologies.

      Comment


      • #4
        PacBio's base caller outputs sequence data in HDF5 format, PacBio's native data format. The HDF files contain base calls for both long reads and circular consensus (if applicable, meaning the reads wrapped around the adapters), as well as quality scores and kinetic measurements. PacBio provides APIs in Python, R and Java for accessing the files. You can download them from www.pacbiodevnet.com.

        When using PacBio's secondary analysis pipeline, you'll get alignments in SAM/BAM+BAI, coverage in BED and GFF, variant calls in GFF and VCF, as well as the FASTA/FASTQ for filtered subreads.

        Comment


        • #5
          Pipeline for Corrected Long read generation

          Thank you for all your answers. Can anybody tell me the pipeline to be followed to generate ths error corrected CLR reads. I have downloaded the SMRT pipe.
          Moreover how are the filtered reads generated? Help me out with the SMRT analysis. I downloaded E.coli raw reads from DevNet. However there seem to be several bas.h5 files. Do I combine them and proceed.
          What are the parameters BLASR takes. Can anybody help?

          Comment


          • #6
            Your best bet will be to use the FASTQ files rather than the raw bas.h5 files. You can download them from the E coli page here:
            Analysis workflows and tools for WGS, targeted, RNA, epigenetics and microbiome and metagenomic sequencing for advanced users.


            For example, you could grab the two filtered subread files for C227-11: one is for CCS, the other for long reads.

            Note that you could download the error-corrected version of the reads as a FASTQ from the same page. To do the error correction yourself, your best bet is pacBioToCA and the Celera Assembler. There are links to them both here:
            Analysis workflows and tools for WGS, targeted, RNA, epigenetics and microbiome and metagenomic sequencing for advanced users.


            One reason is that PacBio's error correction pipeline will be incorporated in the next software release.

            Also, Mike Schatz's presentation is really useful:

            Comment


            • #7
              @jbingham- Thanks loads. I could find that there is a pacBio.spec file that is required. It is not there for any of the reads downloaded from PacBio DevNet. Is it always supplied with the data, as is written in the manual (infact I doubt it..).
              can you shed some more light on it?

              Comment


              • #8
                Krittika,

                We discovered that the CLI for the PacBio SMRT analysis software is not fully supported by PacBio (at least that was our experience). We were trying to use the CLI and ran into problems that only the developers could answer. But we never received satisfactory answers. You also need to use some settings xml files that are difficult to reproduce by hand so I would advise staying away from the CLI for the current version of SMRT analysis.

                That said, SMRTanalysis software does work through the SMRTPortal web interface they provide (which has its own problems since there is no good security model but if you are the only user then it may not be an issue). So your best bet may be to install that and move forward.

                You can set up some of the hybrid assembly through the SMRTportal interface (we are in the process of trying it now). They do recommend having a cluster to run this on so I hope you have access to one and are planning to do this work there.


                Originally posted by krittika.sasmal View Post
                Thank you for all your answers. Can anybody tell me the pipeline to be followed to generate ths error corrected CLR reads. I have downloaded the SMRT pipe.
                Moreover how are the filtered reads generated? Help me out with the SMRT analysis. I downloaded E.coli raw reads from DevNet. However there seem to be several bas.h5 files. Do I combine them and proceed.
                What are the parameters BLASR takes. Can anybody help?

                Comment


                • #9
                  The pacbio.spec file is specific to Celera Assembler. PacBio's pipeline doesn't generate it. Examples are available for SGE



                  and for high memory instances



                  Once you've got a working spec file, you should be able to use it for all analyses.

                  If you use the error corrected reads as your starting point, you can run the SMRT Portal GUI directly, as @GenoMax suggested. You cannot yet do error correction through the GUI. You'll have to do it from the command-line.

                  Comment


                  • #10
                    One more tip: there's also a C++ API to read PacBio HDF files. It's located in the SMRT Analysis source download in

                    cpp/common/data/hdf/HDFBasReader.h

                    Comment


                    • #11
                      question regarding quality scores

                      Apologies if this is a rather naive question, but http://oelemento.wordpress.com/2011/...uence-dataset/ mentioned that PacBio fastq files contain quality scores (c) for each nucleotide in each read. We are not seeing any quality scores in our initial analysis. Any help or suggestions would be greatly appreciated.

                      Comment


                      • #12
                        Current default output of SMRTanalysis is fasta format files as you have noticed. "fastq" format sequence files would be produced as default by a future version of SMRT analysis package but in the mean time you can get quality values from the *.bas.h5 files by using the script PacBio posted here: https://github.com/PacificBiosciences/pbh5tools/

                        Tom Skelly from Sanger recently posted a set of useful scripts for PacBio here: https://github.com/TomSkelly/PacBioEDA

                        Originally posted by rghan View Post
                        Apologies if this is a rather naive question, but http://oelemento.wordpress.com/2011/...uence-dataset/ mentioned that PacBio fastq files contain quality scores (c) for each nucleotide in each read. We are not seeing any quality scores in our initial analysis. Any help or suggestions would be greatly appreciated.

                        Comment


                        • #13
                          Actually, the answer to rghan's "question regarding quality scores" is tougher than it looks. First off, it depends on what you want the fastq file to contain.

                          If you're after the circular consensus reads, that exists as Analysis_Results/<MovieName>.ccs.fastq.

                          But if it's the individual raw reads you're after, what do you want to see? All the bases from all the reads? Probably not: you can't feed that to an aligner, for example. You probably want the raw reads to be split up into subreads of contiguous sequence, with the adapters removed. And you probably want only productivity-1 reads. I.e., you want the fastq equivalent of the filtered_subreads.fasta file produced by secondary analysis.

                          pbh5tools won't give you that, I'm afraid. (Nor will my package ). "bash5tools.py --outType fastq --readType Raw" produces a fastq file containing all the bases from all the reads, unfiltered and un-split.

                          You could extract a fastq file from aligned_reads.sam. But that gives you just what it says: only the sub-reads which secondary analysis managed to align.

                          The next question is: What do those Q scores mean, anyway?

                          The bas.h5 file includes 4 separate probability scores for each basecall: substitution, insertion, deletion Q-probabilities, and an overall "QualityValue". The first three are easy to understand, but I've never been clear on what the 4th one represents. That's the score you see in the SAM and pbh5tools files.

                          I've heard it said that QualityValue is the Q-encoded combination of the first three probabilities. But looking at data, that doesn't appear to be true. (Can't read the code: it's part of primary analysis, not released by PacBio ).

                          And in any case, what do you make of the deletion probability? That's the prob that this basecall may have been followed (preceded?) by a missed base. That doesn't tell you anything about the validity of the basecall itself.

                          Perhaps some helpful PacBio person can shed a bit more light on all this.

                          --TS

                          Comment


                          • #14
                            Quality scores in the pacbio .fastq files

                            Hi, I wanted to know what kind of quality scores are there in a fastq file from pacbio? PHRED 32 /64? or is it Sanger type quality scores?

                            Comment


                            • #15
                              AFAIK, any ascii-encoded Q scores in fastq or SAM files will be encoded Q+33.

                              See last post for caveats about quality scores, however.

                              --TS

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X