Unconfigured Ad

**SillyPoint** · 01-17-2012, 08:00 AM

There's a high-level how-to on pacbiodevnet, describing how they used a combination of consensus and long reads to do de novo assembly of an E.coli strain.

As I understand it, the consensus reads are generated during the primary analysis step, which occurs on the PacBio server. The main product of that step is the *.bas.h5 file, which includes basecalls, several quality scores, and limited kinetics info. At this point, adapters have been identified, and reads can be split into sub-reads.

The secondary analysis, running on your cluster, does filtering of subreads based on productivity, high-quality region length and score, to produce a filtered_subreads.fasta file. It also removes control reads (although those are still prersent in filtered_subreads.fasta -- I think), and then performs an alignment to the reference provided in the protocol to produce a BAM file and various other stuff.

That's my view-from-40000-feet, anyway. Perhaps a helpful PacBio person will wander by and provide a bit more detail.

**krobison** · 01-17-2012, 09:29 AM

MIRA and Celera Assembler are two other assemblers which support de novo assembly using PacBio, and perhaps more importantly mixing PacBio with other technologies.

**jbingham** · 01-18-2012, 05:42 PM

PacBio's base caller outputs sequence data in HDF5 format, PacBio's native data format. The HDF files contain base calls for both long reads and circular consensus (if applicable, meaning the reads wrapped around the adapters), as well as quality scores and kinetic measurements. PacBio provides APIs in Python, R and Java for accessing the files. You can download them from www.pacbiodevnet.com.

When using PacBio's secondary analysis pipeline, you'll get alignments in SAM/BAM+BAI, coverage in BED and GFF, variant calls in GFF and VCF, as well as the FASTA/FASTQ for filtered subreads.

**krittika.sasmal** · 01-18-2012, 09:21 PM

Pipeline for Corrected Long read generation

Thank you for all your answers. Can anybody tell me the pipeline to be followed to generate ths error corrected CLR reads. I have downloaded the SMRT pipe.
Moreover how are the filtered reads generated? Help me out with the SMRT analysis. I downloaded E.coli raw reads from DevNet. However there seem to be several bas.h5 files. Do I combine them and proceed.
What are the parameters BLASR takes. Can anybody help?

**jbingham** · 01-18-2012, 09:36 PM

Your best bet will be to use the FASTQ files rather than the raw bas.h5 files. You can download them from the E coli page here:

Open source tools - PacBio

http://www.pacbiodevnet.com/Share/Datasets/E-coli-Outbreak

Open-source community-developed analysis tools for PacBio SMRT Sequencing data, example data sets, and resources

For example, you could grab the two filtered subread files for C227-11: one is for CCS, the other for long reads.

Note that you could download the error-corrected version of the reads as a FASTQ from the same page. To do the error correction yourself, your best bet is pacBioToCA and the Celera Assembler. There are links to them both here:

Open source tools - PacBio

http://www.pacbiodevnet.com/CodeShare_Project?id=a1q70000000GrT6AAK

Open-source community-developed analysis tools for PacBio SMRT Sequencing data, example data sets, and resources

One reason is that PacBio's error correction pipeline will be incorporated in the next software release.

Also, Mike Schatz's presentation is really useful:

http://schatzlab.cshl.edu/presentations/2011-09-07.PacBio%20Users%20Meeting.pdf

**krittika.sasmal** · 01-19-2012, 04:51 AM

@jbingham- Thanks loads. I could find that there is a pacBio.spec file that is required. It is not there for any of the reads downloaded from PacBio DevNet. Is it always supplied with the data, as is written in the manual (infact I doubt it..).
can you shed some more light on it?

**GenoMax** · 01-19-2012, 12:53 PM

Krittika,

We discovered that the CLI for the PacBio SMRT analysis software is not fully supported by PacBio (at least that was our experience). We were trying to use the CLI and ran into problems that only the developers could answer. But we never received satisfactory answers. You also need to use some settings xml files that are difficult to reproduce by hand so I would advise staying away from the CLI for the current version of SMRT analysis.

That said, SMRTanalysis software does work through the SMRTPortal web interface they provide (which has its own problems since there is no good security model but if you are the only user then it may not be an issue). So your best bet may be to install that and move forward.

You can set up some of the hybrid assembly through the SMRTportal interface (we are in the process of trying it now). They do recommend having a cluster to run this on so I hope you have access to one and are planning to do this work there.

Originally posted by krittika.sasmal View Post

Thank you for all your answers. Can anybody tell me the pipeline to be followed to generate ths error corrected CLR reads. I have downloaded the SMRT pipe.
Moreover how are the filtered reads generated? Help me out with the SMRT analysis. I downloaded E.coli raw reads from DevNet. However there seem to be several bas.h5 files. Do I combine them and proceed.
What are the parameters BLASR takes. Can anybody help?

**jbingham** · 01-19-2012, 01:35 PM

The pacbio.spec file is specific to Celera Assembler. PacBio's pipeline doesn't generate it. Examples are available for SGE

http://www.cbcb.umd.edu/~sergek/PacBio/data/sampleData/pacbio.SGE.spec

and for high memory instances

http://www.cbcb.umd.edu/~sergek/PacBio/data/sampleData/pacbio.spec

Once you've got a working spec file, you should be able to use it for all analyses.

If you use the error corrected reads as your starting point, you can run the SMRT Portal GUI directly, as @GenoMax suggested. You cannot yet do error correction through the GUI. You'll have to do it from the command-line.

**jbingham** · 01-20-2012, 03:28 PM

One more tip: there's also a C++ API to read PacBio HDF files. It's located in the SMRT Analysis source download in

cpp/common/data/hdf/HDFBasReader.h

**rghan** · 01-25-2012, 04:27 AM

question regarding quality scores

Apologies if this is a rather naive question, but http://oelemento.wordpress.com/2011/...uence-dataset/ mentioned that PacBio fastq files contain quality scores (c) for each nucleotide in each read. We are not seeing any quality scores in our initial analysis. Any help or suggestions would be greatly appreciated.

**GenoMax** · 01-25-2012, 04:41 AM

Current default output of SMRTanalysis is fasta format files as you have noticed. "fastq" format sequence files would be produced as default by a future version of SMRT analysis package but in the mean time you can get quality values from the *.bas.h5 files by using the script PacBio posted here: https://github.com/PacificBiosciences/pbh5tools/

Tom Skelly from Sanger recently posted a set of useful scripts for PacBio here: https://github.com/TomSkelly/PacBioEDA

Originally posted by rghan View Post

Apologies if this is a rather naive question, but http://oelemento.wordpress.com/2011/...uence-dataset/ mentioned that PacBio fastq files contain quality scores (c) for each nucleotide in each read. We are not seeing any quality scores in our initial analysis. Any help or suggestions would be greatly appreciated.

**SillyPoint** · 01-26-2012, 10:35 AM

Actually, the answer to rghan's "question regarding quality scores" is tougher than it looks. First off, it depends on what you want the fastq file to contain.

If you're after the circular consensus reads, that exists as Analysis_Results/<MovieName>.ccs.fastq.

But if it's the individual raw reads you're after, what do you want to see? All the bases from all the reads? Probably not: you can't feed that to an aligner, for example. You probably want the raw reads to be split up into subreads of contiguous sequence, with the adapters removed. And you probably want only productivity-1 reads. I.e., you want the fastq equivalent of the filtered_subreads.fasta file produced by secondary analysis.

pbh5tools won't give you that, I'm afraid. (Nor will my package

). "bash5tools.py --outType fastq --readType Raw" produces a fastq file containing all the bases from all the reads, unfiltered and un-split.

You could extract a fastq file from aligned_reads.sam. But that gives you just what it says: only the sub-reads which secondary analysis managed to align.

The next question is: What do those Q scores mean, anyway?

The bas.h5 file includes 4 separate probability scores for each basecall: substitution, insertion, deletion Q-probabilities, and an overall "QualityValue". The first three are easy to understand, but I've never been clear on what the 4th one represents. That's the score you see in the SAM and pbh5tools files.

I've heard it said that QualityValue is the Q-encoded combination of the first three probabilities. But looking at data, that doesn't appear to be true. (Can't read the code: it's part of primary analysis, not released by PacBio

).

And in any case, what do you make of the deletion probability? That's the prob that this basecall may have been followed (preceded?) by a missed base. That doesn't tell you anything about the validity of the basecall itself.

Perhaps some helpful PacBio person can shed a bit more light on all this.

--TS

**krittika.sasmal** · 01-26-2012, 08:40 PM

Quality scores in the pacbio .fastq files

Hi, I wanted to know what kind of quality scores are there in a fastq file from pacbio? PHRED 32 /64? or is it Sanger type quality scores?

**SillyPoint** · 01-27-2012, 09:58 AM

AFAIK, any ascii-encoded Q scores in fastq or SAM files will be encoded Q+33.

See last post for caveats about quality scores, however.

--TS

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 44 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

DeNovo assembly using pacBio data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News