Seqanswers Leaderboard Ad

**SillyPoint** · 01-17-2012, 08:00 AM

There's a high-level how-to on pacbiodevnet, describing how they used a combination of consensus and long reads to do de novo assembly of an E.coli strain.

As I understand it, the consensus reads are generated during the primary analysis step, which occurs on the PacBio server. The main product of that step is the *.bas.h5 file, which includes basecalls, several quality scores, and limited kinetics info. At this point, adapters have been identified, and reads can be split into sub-reads.

The secondary analysis, running on your cluster, does filtering of subreads based on productivity, high-quality region length and score, to produce a filtered_subreads.fasta file. It also removes control reads (although those are still prersent in filtered_subreads.fasta -- I think), and then performs an alignment to the reference provided in the protocol to produce a BAM file and various other stuff.

That's my view-from-40000-feet, anyway. Perhaps a helpful PacBio person will wander by and provide a bit more detail.

**krobison** · 01-17-2012, 09:29 AM

MIRA and Celera Assembler are two other assemblers which support de novo assembly using PacBio, and perhaps more importantly mixing PacBio with other technologies.

**jbingham** · 01-18-2012, 05:42 PM

PacBio's base caller outputs sequence data in HDF5 format, PacBio's native data format. The HDF files contain base calls for both long reads and circular consensus (if applicable, meaning the reads wrapped around the adapters), as well as quality scores and kinetic measurements. PacBio provides APIs in Python, R and Java for accessing the files. You can download them from www.pacbiodevnet.com.

When using PacBio's secondary analysis pipeline, you'll get alignments in SAM/BAM+BAI, coverage in BED and GFF, variant calls in GFF and VCF, as well as the FASTA/FASTQ for filtered subreads.

**krittika.sasmal** · 01-18-2012, 09:21 PM

Pipeline for Corrected Long read generation

Thank you for all your answers. Can anybody tell me the pipeline to be followed to generate ths error corrected CLR reads. I have downloaded the SMRT pipe.
Moreover how are the filtered reads generated? Help me out with the SMRT analysis. I downloaded E.coli raw reads from DevNet. However there seem to be several bas.h5 files. Do I combine them and proceed.
What are the parameters BLASR takes. Can anybody help?

**jbingham** · 01-18-2012, 09:36 PM

Your best bet will be to use the FASTQ files rather than the raw bas.h5 files. You can download them from the E coli page here:

Computational tools - PacBio

http://www.pacbiodevnet.com/Share/Datasets/E-coli-Outbreak

Analysis workflows and tools for WGS, targeted, RNA, epigenetics and microbiome and metagenomic sequencing for advanced users.

For example, you could grab the two filtered subread files for C227-11: one is for CCS, the other for long reads.

Note that you could download the error-corrected version of the reads as a FASTQ from the same page. To do the error correction yourself, your best bet is pacBioToCA and the Celera Assembler. There are links to them both here:

Computational tools - PacBio

http://www.pacbiodevnet.com/CodeShare_Project?id=a1q70000000GrT6AAK

Analysis workflows and tools for WGS, targeted, RNA, epigenetics and microbiome and metagenomic sequencing for advanced users.

One reason is that PacBio's error correction pipeline will be incorporated in the next software release.

Also, Mike Schatz's presentation is really useful:

http://schatzlab.cshl.edu/presentations/2011-09-07.PacBio%20Users%20Meeting.pdf

**krittika.sasmal** · 01-19-2012, 04:51 AM

@jbingham- Thanks loads. I could find that there is a pacBio.spec file that is required. It is not there for any of the reads downloaded from PacBio DevNet. Is it always supplied with the data, as is written in the manual (infact I doubt it..).
can you shed some more light on it?

**GenoMax** · 01-19-2012, 12:53 PM

Krittika,

We discovered that the CLI for the PacBio SMRT analysis software is not fully supported by PacBio (at least that was our experience). We were trying to use the CLI and ran into problems that only the developers could answer. But we never received satisfactory answers. You also need to use some settings xml files that are difficult to reproduce by hand so I would advise staying away from the CLI for the current version of SMRT analysis.

That said, SMRTanalysis software does work through the SMRTPortal web interface they provide (which has its own problems since there is no good security model but if you are the only user then it may not be an issue). So your best bet may be to install that and move forward.

You can set up some of the hybrid assembly through the SMRTportal interface (we are in the process of trying it now). They do recommend having a cluster to run this on so I hope you have access to one and are planning to do this work there.

Originally posted by krittika.sasmal View Post

Thank you for all your answers. Can anybody tell me the pipeline to be followed to generate ths error corrected CLR reads. I have downloaded the SMRT pipe.
Moreover how are the filtered reads generated? Help me out with the SMRT analysis. I downloaded E.coli raw reads from DevNet. However there seem to be several bas.h5 files. Do I combine them and proceed.
What are the parameters BLASR takes. Can anybody help?

**jbingham** · 01-19-2012, 01:35 PM

The pacbio.spec file is specific to Celera Assembler. PacBio's pipeline doesn't generate it. Examples are available for SGE

404 Not Found

http://www.cbcb.umd.edu/~sergek/PacBio/data/sampleData/pacbio.SGE.spec

and for high memory instances

404 Not Found

http://www.cbcb.umd.edu/~sergek/PacBio/data/sampleData/pacbio.spec

Once you've got a working spec file, you should be able to use it for all analyses.

If you use the error corrected reads as your starting point, you can run the SMRT Portal GUI directly, as @GenoMax suggested. You cannot yet do error correction through the GUI. You'll have to do it from the command-line.

**jbingham** · 01-20-2012, 03:28 PM

One more tip: there's also a C++ API to read PacBio HDF files. It's located in the SMRT Analysis source download in

cpp/common/data/hdf/HDFBasReader.h

**rghan** · 01-25-2012, 04:27 AM

question regarding quality scores

Apologies if this is a rather naive question, but http://oelemento.wordpress.com/2011/...uence-dataset/ mentioned that PacBio fastq files contain quality scores (c) for each nucleotide in each read. We are not seeing any quality scores in our initial analysis. Any help or suggestions would be greatly appreciated.

**GenoMax** · 01-25-2012, 04:41 AM

Current default output of SMRTanalysis is fasta format files as you have noticed. "fastq" format sequence files would be produced as default by a future version of SMRT analysis package but in the mean time you can get quality values from the *.bas.h5 files by using the script PacBio posted here: https://github.com/PacificBiosciences/pbh5tools/

Tom Skelly from Sanger recently posted a set of useful scripts for PacBio here: https://github.com/TomSkelly/PacBioEDA

Originally posted by rghan View Post

Apologies if this is a rather naive question, but http://oelemento.wordpress.com/2011/...uence-dataset/ mentioned that PacBio fastq files contain quality scores (c) for each nucleotide in each read. We are not seeing any quality scores in our initial analysis. Any help or suggestions would be greatly appreciated.

**SillyPoint** · 01-26-2012, 10:35 AM

Actually, the answer to rghan's "question regarding quality scores" is tougher than it looks. First off, it depends on what you want the fastq file to contain.

If you're after the circular consensus reads, that exists as Analysis_Results/<MovieName>.ccs.fastq.

But if it's the individual raw reads you're after, what do you want to see? All the bases from all the reads? Probably not: you can't feed that to an aligner, for example. You probably want the raw reads to be split up into subreads of contiguous sequence, with the adapters removed. And you probably want only productivity-1 reads. I.e., you want the fastq equivalent of the filtered_subreads.fasta file produced by secondary analysis.

pbh5tools won't give you that, I'm afraid. (Nor will my package

). "bash5tools.py --outType fastq --readType Raw" produces a fastq file containing all the bases from all the reads, unfiltered and un-split.

You could extract a fastq file from aligned_reads.sam. But that gives you just what it says: only the sub-reads which secondary analysis managed to align.

The next question is: What do those Q scores mean, anyway?

The bas.h5 file includes 4 separate probability scores for each basecall: substitution, insertion, deletion Q-probabilities, and an overall "QualityValue". The first three are easy to understand, but I've never been clear on what the 4th one represents. That's the score you see in the SAM and pbh5tools files.

I've heard it said that QualityValue is the Q-encoded combination of the first three probabilities. But looking at data, that doesn't appear to be true. (Can't read the code: it's part of primary analysis, not released by PacBio

).

And in any case, what do you make of the deletion probability? That's the prob that this basecall may have been followed (preceded?) by a missed base. That doesn't tell you anything about the validity of the basecall itself.

Perhaps some helpful PacBio person can shed a bit more light on all this.

--TS

**krittika.sasmal** · 01-26-2012, 08:40 PM

Quality scores in the pacbio .fastq files

Hi, I wanted to know what kind of quality scores are there in a fastq file from pacbio? PHRED 32 /64? or is it Sanger type quality scores?

**SillyPoint** · 01-27-2012, 09:58 AM

AFAIK, any ascii-encoded Q scores in fastq or SAM files will be encoded Q+33.

See last post for caveats about quality scores, however.

--TS

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 46 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

DeNovo assembly using pacBio data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News