PDA

View Full Version : DeNovo assembly using pacBio data


krittika.sasmal
01-17-2012, 12:52 AM
Hi,
I am new to PacBio. I have visited the pacBio site. However could not clearly understand what files are generated by the primary analysis or base calling. Is it the CLR or CCS files? how do we actually go about generating the filtered read files. If I am interested to use the SMRT pipe, what input does it take?

Also, can anybody please suggest a de novo assembler that works well with PacBio data?:confused:

SillyPoint
01-17-2012, 07:00 AM
There's a high-level how-to (http://www.pacbiodevnet.com/Learn/How-To/E-coli-O104-Assembly) on pacbiodevnet, describing how they used a combination of consensus and long reads to do de novo assembly of an E.coli strain.

As I understand it, the consensus reads are generated during the primary analysis step, which occurs on the PacBio server. The main product of that step is the *.bas.h5 file, which includes basecalls, several quality scores, and limited kinetics info. At this point, adapters have been identified, and reads can be split into sub-reads.

The secondary analysis, running on your cluster, does filtering of subreads based on productivity, high-quality region length and score, to produce a filtered_subreads.fasta file. It also removes control reads (although those are still prersent in filtered_subreads.fasta -- I think), and then performs an alignment to the reference provided in the protocol to produce a BAM file and various other stuff.

That's my view-from-40000-feet, anyway. Perhaps a helpful PacBio person will wander by and provide a bit more detail.

krobison
01-17-2012, 08:29 AM
MIRA (http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page) and Celera Assembler (http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page) are two other assemblers which support de novo assembly using PacBio, and perhaps more importantly mixing PacBio with other technologies.

jbingham
01-18-2012, 04:42 PM
PacBio's base caller outputs sequence data in HDF5 format, PacBio's native data format. The HDF files contain base calls for both long reads and circular consensus (if applicable, meaning the reads wrapped around the adapters), as well as quality scores and kinetic measurements. PacBio provides APIs in Python, R and Java for accessing the files. You can download them from www.pacbiodevnet.com.

When using PacBio's secondary analysis pipeline, you'll get alignments in SAM/BAM+BAI, coverage in BED and GFF, variant calls in GFF and VCF, as well as the FASTA/FASTQ for filtered subreads.

krittika.sasmal
01-18-2012, 08:21 PM
Thank you for all your answers. Can anybody tell me the pipeline to be followed to generate ths error corrected CLR reads. I have downloaded the SMRT pipe.
Moreover how are the filtered reads generated? Help me out with the SMRT analysis. I downloaded E.coli raw reads from DevNet. However there seem to be several bas.h5 files. Do I combine them and proceed.
What are the parameters BLASR takes. Can anybody help?

jbingham
01-18-2012, 08:36 PM
Your best bet will be to use the FASTQ files rather than the raw bas.h5 files. You can download them from the E coli page here:
http://www.pacbiodevnet.com/Share/Datasets/E-coli-Outbreak

For example, you could grab the two filtered subread files for C227-11: one is for CCS, the other for long reads.

Note that you could download the error-corrected version of the reads as a FASTQ from the same page. To do the error correction yourself, your best bet is pacBioToCA and the Celera Assembler. There are links to them both here:
http://www.pacbiodevnet.com/CodeShare_Project?id=a1q70000000GrT6AAK

One reason is that PacBio's error correction pipeline will be incorporated in the next software release.

Also, Mike Schatz's presentation is really useful:
http://schatzlab.cshl.edu/presentations/2011-09-07.PacBio%20Users%20Meeting.pdf

krittika.sasmal
01-19-2012, 03:51 AM
@jbingham- Thanks loads. I could find that there is a pacBio.spec file that is required. It is not there for any of the reads downloaded from PacBio DevNet. Is it always supplied with the data, as is written in the manual (infact I doubt it..).
can you shed some more light on it?

GenoMax
01-19-2012, 11:53 AM
Krittika,

We discovered that the CLI for the PacBio SMRT analysis software is not fully supported by PacBio (at least that was our experience). We were trying to use the CLI and ran into problems that only the developers could answer. But we never received satisfactory answers. You also need to use some settings xml files that are difficult to reproduce by hand so I would advise staying away from the CLI for the current version of SMRT analysis.

That said, SMRTanalysis software does work through the SMRTPortal web interface they provide (which has its own problems since there is no good security model but if you are the only user then it may not be an issue). So your best bet may be to install that and move forward.

You can set up some of the hybrid assembly through the SMRTportal interface (we are in the process of trying it now). They do recommend having a cluster to run this on so I hope you have access to one and are planning to do this work there.


Thank you for all your answers. Can anybody tell me the pipeline to be followed to generate ths error corrected CLR reads. I have downloaded the SMRT pipe.
Moreover how are the filtered reads generated? Help me out with the SMRT analysis. I downloaded E.coli raw reads from DevNet. However there seem to be several bas.h5 files. Do I combine them and proceed.
What are the parameters BLASR takes. Can anybody help?

jbingham
01-19-2012, 12:35 PM
The pacbio.spec file is specific to Celera Assembler. PacBio's pipeline doesn't generate it. Examples are available for SGE

http://www.cbcb.umd.edu/~sergek/PacBio/data/sampleData/pacbio.SGE.spec

and for high memory instances

http://www.cbcb.umd.edu/~sergek/PacBio/data/sampleData/pacbio.spec

Once you've got a working spec file, you should be able to use it for all analyses.

If you use the error corrected reads as your starting point, you can run the SMRT Portal GUI directly, as @GenoMax suggested. You cannot yet do error correction through the GUI. You'll have to do it from the command-line.

jbingham
01-20-2012, 02:28 PM
One more tip: there's also a C++ API to read PacBio HDF files. It's located in the SMRT Analysis source download in

cpp/common/data/hdf/HDFBasReader.h

rghan
01-25-2012, 03:27 AM
Apologies if this is a rather naive question, but http://oelemento.wordpress.com/2011/01/03/a-closer-look-at-the-first-pacbio-sequence-dataset/ mentioned that PacBio fastq files contain quality scores (c) for each nucleotide in each read. We are not seeing any quality scores in our initial analysis. Any help or suggestions would be greatly appreciated.

GenoMax
01-25-2012, 03:41 AM
Current default output of SMRTanalysis is fasta format files as you have noticed. "fastq" format sequence files would be produced as default by a future version of SMRT analysis package but in the mean time you can get quality values from the *.bas.h5 files by using the script PacBio posted here: https://github.com/PacificBiosciences/pbh5tools/

Tom Skelly from Sanger recently posted a set of useful scripts for PacBio here: https://github.com/TomSkelly/PacBioEDA

Apologies if this is a rather naive question, but http://oelemento.wordpress.com/2011/01/03/a-closer-look-at-the-first-pacbio-sequence-dataset/ mentioned that PacBio fastq files contain quality scores (c) for each nucleotide in each read. We are not seeing any quality scores in our initial analysis. Any help or suggestions would be greatly appreciated.

SillyPoint
01-26-2012, 09:35 AM
Actually, the answer to rghan's "question regarding quality scores" is tougher than it looks. First off, it depends on what you want the fastq file to contain.

If you're after the circular consensus reads, that exists as Analysis_Results/<MovieName>.ccs.fastq.

But if it's the individual raw reads you're after, what do you want to see? All the bases from all the reads? Probably not: you can't feed that to an aligner, for example. You probably want the raw reads to be split up into subreads of contiguous sequence, with the adapters removed. And you probably want only productivity-1 reads. I.e., you want the fastq equivalent of the filtered_subreads.fasta file produced by secondary analysis.

pbh5tools won't give you that, I'm afraid. (Nor will my package :(). "bash5tools.py --outType fastq --readType Raw" produces a fastq file containing all the bases from all the reads, unfiltered and un-split.

You could extract a fastq file from aligned_reads.sam. But that gives you just what it says: only the sub-reads which secondary analysis managed to align.

The next question is: What do those Q scores mean, anyway?

The bas.h5 file includes 4 separate probability scores for each basecall: substitution, insertion, deletion Q-probabilities, and an overall "QualityValue". The first three are easy to understand, but I've never been clear on what the 4th one represents. That's the score you see in the SAM and pbh5tools files.

I've heard it said that QualityValue is the Q-encoded combination of the first three probabilities. But looking at data, that doesn't appear to be true. (Can't read the code: it's part of primary analysis, not released by PacBio :mad:).

And in any case, what do you make of the deletion probability? That's the prob that this basecall may have been followed (preceded?) by a missed base. That doesn't tell you anything about the validity of the basecall itself.

Perhaps some helpful PacBio person can shed a bit more light on all this.

--TS

krittika.sasmal
01-26-2012, 07:40 PM
Hi, I wanted to know what kind of quality scores are there in a fastq file from pacbio? PHRED 32 /64? or is it Sanger type quality scores?

SillyPoint
01-27-2012, 08:58 AM
AFAIK, any ascii-encoded Q scores in fastq or SAM files will be encoded Q+33.

See last post for caveats about quality scores, however.

--TS

jbingham
01-28-2012, 08:24 AM
The quality scores are phred-style. Insertion QVs give the likelihood that the given base is itself an insertion. Deletion QVs refer to the immediately preceding base, and the most likely deleted base is stored as the DeletionTag in the HDF file. In PacBio data, insertions are the most common error. The QV in the FASTQ is predominated by the insertion QV.

rghan
01-30-2012, 11:58 PM
Replying to SillyPoint-

We were interested in employing the pacBioToCA script, but the pacBioToCA pipeline expects PacBio RS sequences in fastq format (with sanger (PHRED32) quality values). We were only given an assembled.fasta and a filtered_subreads.fasta.

sergek
01-31-2012, 11:15 AM
I am the developer of pacBioToCA, happy to see interest in the pipeline. If you have only fasta files, the pacBioToCA wiki page includes a section on inputting PacBio RS sequences: http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PacBioToCA#Inputting_PacBio_RS_Sequences

We provide a java utility to convert the fasta data to fastq with uniform quality values (http://www.cbcb.umd.edu/~sergek/PacBio/data/convertFastaAndQualToFastq.tar.gz). The instructions for using it are at the above link.

rghan
01-31-2012, 11:40 AM
Thanks for the information, sergek. I will try and take a look at this when I get to the lab in the morning.

krittika.sasmal
02-02-2012, 03:21 AM
Thanks for all your replies. I want to understand the SMRT pipe for running assembly. I understand the BLASR has to be run to align the longreads and CCS reads. And then make a consensus through the make-consensus from amos. However the input of the amos make-consensus is the TIG file. How do we go about genrating from the BLASR output? Please somebody help me on the pipeline.

sagarutturkar
02-14-2013, 10:35 AM
Hi All,

We are trying to install SMRTanalysis software from pacbio. We are getting error as below:

File "./smrtpipe.py", line 4, in <module>
import pkg_resources
File "/usr/local/lib64/python2.6/site-packages/distribute-0.6.24-py2.6.egg/pkg_resources.py", line 2707, in <module>
working_set.require(__requires__)
File "/usr/local/lib64/python2.6/site-packages/distribute-0.6.24-py2.6.egg/pkg_resources.py", line 686, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/local/lib64/python2.6/site-packages/distribute-0.6.24-py2.6.egg/pkg_resources.py", line 584, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: pbpy==0.1

Seems like something wrong with python. We updated to python version 2.7 and included correct path in .bashrc files. Still get the same error.:mad:

Any suggestions from previous experience?

Thanks

GenoMax
02-14-2013, 11:00 AM
I think SMRTanalysis comes bundled with its own python. The error above seems to be referring to a system python2.6 directory though, so are you using the system python?

For clarity you should have started a new thread instead of posting under this one.

sagarutturkar
02-14-2013, 11:33 AM
I think SMRTanalysis comes bundled with its own python. The error above seems to be referring to a system python2.6 directory though, so are you using the system python?

For clarity you should have started a new thread instead of posting under this one.

Hi,

Thanks for the quick reply. I have created new thread here:
http://seqanswers.com/forums/showthread.php?p=96421#post96421

and updated query with more details. Please reply.

Thanks

juassis
02-21-2013, 07:39 AM
Hello

I don't understand the error message generated (DeNovo assembly using pacBio data)
with the sample data (e.coli and lambda) is ok.

Then run the command 'smrtpipe.py --params=settings.xml xml:input.xml &>smrtpipe.err' and got an error log message as below:

INFO] 2013-02-20 15:57:08,552 [pbpy.smrtpipe.SmrtDataService writeTo 424] Writing 6 items to DataStore in {'smrt.data.xmlparam': <pbpy.io.MetaAnalysisXml.InputDataUrl object at 0x4295a50>, 'smrt.output.log': '/sto4data-2/zebu4/data/06022013_smrtpipe/teste1_gir/log', 'smrt.data.cmdline': <pbpy.smrtpipe.InputData.CompositeInputData object at 0x42959d0>, 'smrt.output.root': '/sto4data-2/zebu4/data/06022013_smrtpipe/teste1_gir', 'smrt.output.results': '/sto4data-2/zebu4/data/06022013_smrtpipe/teste1_gir/results', 'smrt.output.data': '/sto4data-2/zebu4/data/06022013_smrtpipe/teste1_gir/data'}
[INFO] 2013-02-20 15:57:08,555 [pbpy.smrtpipe.SmrtPipeMain _runTasks 267] Skipping PreWorkflow as it contains zero tasks
[INFO] 2013-02-20 15:57:08,558 [pbpy.smrtpipe.SmrtPipeMain _runTasks 270] Loading 10 tasks into Workflow
[INFO] 2013-02-20 15:57:09,275 [pbpy.smrtpipe.SmrtPipeMain _runTasks 279] Executing workflow Workflow
[INFO] 2013-02-20 15:57:09,649 [pbpy.smrtpipe.engine.SmrtPipeTasks run 622] Running task://Anonymous/P_Fetch/toFofn
[ERROR] 2013-02-20 15:57:14,702 [pbpy.smrtpipe.SmrtPipeMain run 648] time data 'Qua Fev 20 15:57:09 CST 2013' does not match format '%a %b %d %H:%M:%S %Z %Y' Traceback (most recent call last):
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/SmrtPipeMain.py", line 608, in run self._runTasks(pModules)
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/SmrtPipeMain.py", line 281, in _runTasks workflow.execute()
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtPipeWorkflow.py", line 607, in execute self._update(0)
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtPipeWorkflow.py", line 574, in _update self._writeWorkflow()
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtPipeWorkflow.py", line 554, in _writeWorkflow self._graph.toFile(path, format)
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtDAG.py", line 258, in toFile out.write(format2func[format](self))
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtDAG.py", line 255, in <lambda> 'RDF': lambda g: g.toRDF().serialize(),
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtDAG.py", line 208, in toRDF for s, p, o in node.toRDF():
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtDAG.py", line 81, in toRDF Literal(str(self.obj.computeTime))))
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtPipeTasks.py", line 834, in computeTime self._extractComputeTime(regexp)
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtPipeTasks.py", line 821, in _extractComputeTime self._cachedExecTimes[regexp] = datetime.datetime.strptime(match.group(1), LOG_TIME_FORMAT)
File "/opt/smrtanalysis-1.4.0/redist/python2.7/lib/python2.7/_strptime.py", line 325, in _strptime (data_string, format))
ValueError: time data 'Qua Fev 20 15:57:09 CST 2013' does not match format '%a %b %d %H:%M:%S %Z %Y'
[ERROR] 2013-02-20 15:57:14,704 [pbpy.smrtpipe.SmrtPipeMain exit 760] time data 'Qua Fev 20 15:57:09 CST 2013' does not match format '%a %b %d %H:%M:%S %Z %Y'


I need help =)

rhall
02-21-2013, 08:46 AM
juassis,
Unfortunately it is a bug due to system location, for a fix, add the following two lines to $SEYMOUR_HOME/etc/setup.sh:
export LANG=en_US.UTF-8
export LANG=en_US.UTF-8
Which sets some environment variables used by python. PacBio is aware of the bug and it should be fixed in the next release.

rhall
02-21-2013, 08:47 AM
Sorry, second line is:
export LC_ALL=en_US.UTF-8

juassis
02-21-2013, 03:18 PM
Hello!
Thanks for the information!
Worked properly! =)

Just one more question,
worked properly in the first analysis, however, when presented new data from another bred came up again this error. I'll have to fix every time I want to examine?

rhall
02-22-2013, 09:44 AM
If the lines are added to the $SEYMOUR_HOME/etc/setup.sh file then SMRT pipe should function correctly after the setup.sh file is sourced.

juassis
03-07-2013, 07:34 AM
Hello,
thank you very much for your help and comments. It was possible to correct several samples
Again some problems. I was able to run the smrtpipe.py command without any errors. However when I tried to run again the SMRTpipe the error message appears:

. /opt/smrtanalysis/etc/setup.sh
$ smrtpipe.py --params=gir_params.xml xml:gir_input.xml

Bus error (core dumped)

--
I did the memory test, and everything is ok.

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 4133745
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Filesystem Size Used Avail Use% Mounted on
/dev/sdg 12T 900G 11T 9% /


Many thanks for your help. =)

rhall
03-07-2013, 07:51 AM
Do you get any output at all? Anything in the ./log/ directory?

What distribution are you running on, this could possibly be a result of a mismatch between the system it is running on and the build system, either Ubuntu 10.04, or Centos 5.6.

rhall
03-07-2013, 07:55 AM
Also try running with the --debug option

juassis
03-07-2013, 08:07 AM
wasn´t generated log, I can't even run the smrtpipe.py --help.
sometimes the pipeline functioned.

distribution:

Red Hat Enterprise Linux Server release 6.4 (Santiago).

thanks =)

rhall
03-07-2013, 08:13 AM
Unfortunately this is due to the system - build system mismatch, and SMRT Analysis cannot be easily rebuilt. My only suggestion would be to run a virtual machine using something like virtual box (https://www.virtualbox.org/) installing a Ubuntu 10.04 system then installing SMRT Analysis on top of that.

juassis
03-07-2013, 08:34 AM
rhall,
do you work for pacbio? It has a support contact?

It's impossible work in another machine

Many thanks for your help. =)

rhall
03-07-2013, 08:45 AM
I work for PacBio, but not in tech support. If the installation is tied to a machine then go through your FAS for SMRT Analysis install support.

SMRT Analysis is not supported on Red Hat 6.4. You do not have to work on another machine, running virtual box on the Red Hat system will allow you to install a virtual Ubuntu 10.04 system for running SMRT Analysis.
For a discussion of the options, including using amazon AMI see:
http://seqanswers.com/forums/showthread.php?p=96421#post96421

juassis
03-07-2013, 08:51 AM
unhh, ok!
I'll try the running virtual box!

thanks

GenoMax
03-07-2013, 08:51 AM
I work for PacBio, but not in tech support. If the installation is tied to a machine then go through your FAS for SMRT Analysis install support.


That solves that mystery. You had very specific info that normal users do not (and we have had pacbio for a while).

Juassis: PacBio has a customer portal at: http://www.pacbioportal.com/. I am not sure if you need to have an instrument to get access. As rhall suggested you may need to go through your FAS.

yaximik
03-28-2013, 05:14 AM
After browsing through the PacBio site I got impression that now the system is self-contained, that is both primary and secondary analysis are done locally. Is that correct, or raw data are still uploaded for primary analysis to PacBio's server and then send back for secondary analysis done locally? That would make it quite dependable on network reliability and the transfer rate, which varies a lot on my local network.

rhall
03-28-2013, 08:26 AM
Primary analysis is done on the machine, the data is then transferred off the machine for secondary.

yaximik
03-28-2013, 08:57 AM
Primary analysis is done on the machine, the data is then transferred off the machine for secondary.

I guess I was not clear. My understanding was that since beginning primary processing (base calling, etc.) was done somewhere at PacBio uusing their proprietary software. Correct me if I am wrong.

Now it seem all done localy, I mean on the instrument (primary) and off instrument, on some separate datarig (secondary).

It is mentioned that a blade cluster is the part of the instrument, I guess inside their big box, intended for primary processing. Then a separate big cluster is needed for secondary analysis, which is not the part of the package, correct?

I sent a info request to PacBio, hope their sales people get back to me and explain all that.

rhall
03-28-2013, 09:16 AM
Primary analysis, base calling, has always been done on the blade cluster, which is part of the instrument.
A separate cluster is needed for secondary analysis tasks, that is not part of the package that PacBio sells, the size of this depends on what you want to do and can range from a high end workstation to a 100 node cluster.

GenoMax
03-28-2013, 11:43 AM
Then a separate big cluster is needed for secondary analysis, which is not the part of the package, correct?


Rhall has already explained the basics. You can download the secondary analysis software (both command line/web front end) here: http://pacbiodevnet.com/.

Note: Secondary analysis software is only supported on CentOS and Ubuntu (note the specific versions there in). Stepping outside those will likely prove an exercise in futility (i.e. do not try it, in most cases it does not work). If you are running a OS version that is not one of the two supported then consider using a virtual server.

rhall
04-17-2013, 12:14 PM
For anyone still having issues with SMRT Analysis installations, I'm experimenting with a simpler, somewhat limited, install method using Vagrant and VirtualBox.
https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/SMRT-Analysis-2.0-Virtual-Machine-Install-(Experimental)

juassis
05-21-2013, 05:01 AM
[QUOTE=rhall;98498]I work for PacBio, but not in tech support. If the installation is tied to a machine then go through your FAS for SMRT Analysis install support.

Cheers for your answer, it was very helpful. :)
But, I don't understand the error message generated, with solid datas. :confused:

Solid - library mate pair - reads = 2.159.990.294 - coverage 37x (151 GB)

error log message as below:

[ERROR] 2013-05-20 14:24:19,051 [pbpy.smrtpipe.engine.SmrtPipeWorkflow execute 602] task://Anonymous/P_Filter/filter_001of001 Failed
Traceback (most recent call last):
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtPipeWorkflow.py", line 600, in execute
raise WorkflowError(task.error)
WorkflowError: task://Anonymous/P_Filter/filter_001of001 Failed

[ERROR] 2013-05-20 14:24:21,218 [pbpy.smrtpipe.SmrtPipeMain run 648] SmrtExit task://Anonymous/P_Filter/filter_001of001 Failed
Traceback (most recent call last):
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/SmrtPipeMain.py", line 608, in run
self._runTasks(pModules)
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/SmrtPipeMain.py", line 281, in _runTasks
workflow.execute()
File "/opt/smrtanalysis-1.4.0/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/pbpy/smrtpipe/engine/SmrtPipeWorkflow.py", line 605, in execute
raise SmrtExit(str(e))
SmrtExit: SmrtExit task://Anonymous/P_Filter/filter_001of001 Failed
[ERROR] 2013-05-20 14:24:21,238 [pbpy.smrtpipe.SmrtPipeMain exit 760] SmrtExit task://Anonymous/P_Filter/filter_001of001 Failed

---

thanks

rhall
05-24-2013, 03:10 PM
Correct me if I'm wrong, are you trying to use SMRT Analysis with Solid sequence data? While some parts of SMRT Analysis may be able to handle Solid data, the workflow is really designed for PacBio data.
Trying to run the command in ./workflow/P_Filter/filter_001of001.sh should give you a more descriptive errror.

sagarutturkar
06-04-2013, 10:51 AM
Hi,

I recently upgraded to version 2.0 of SmartPortal. When I go to design new job, I see 3 SMRt cells available with different ID. Is there any way, I could delete this SMRT cells and re-import again as fresh?

My role is 'Scientist' and assigned to group 'all'.

Also, when I tried to run this with test data as mentioned at :

https://github.com/PacificBiosciences/SMRT-Analysis/wiki/SMRT-Analysis-Software-Installation-v2.0I got the error given in log file.

Please help with this.

Thanks
Sagar

Thanks
Sagar

sagarutturkar
06-04-2013, 11:24 AM
Attached log file here.

GenoMax
06-04-2013, 11:29 AM
rhall provided an important pointer about SMRTanalysis v.2.0 which got our install of SMRTportal going with HGAP protocol.

The critical pointer was to make sure that the cluster nodes were also set up as submit hosts. I do not think that is explicitly mentioned in the install materials (at least it was not before).

rhall
06-04-2013, 11:35 AM
Sagar, can you try posting the log file again?
Thanks.

sagarutturkar
06-05-2013, 08:29 AM
Hi rhall,

Thanks for your reply and providing the data. I am able to run the SMRTportal 2.0 now. I was able to run the HGAP and AHA. For our small Microbial Genomes, I have 1 SMRTcell sequenced which is around 25x coverage.

Genome 1 - CF80
The AHA and HGAP failed, Log attached. I was able to run the AHA through command line (v 1.4) using same files (I used fastX toolkit to convert filtered_subreads.fastq file to fasta format and used existing assembly as reference.)

Genome 2 - BT
I was able to run the HGAp. But the expected genome size was 11 MB and my assembly size is pretty small. What could be done to improve this.

Logs attached.

Thanks
Sagar

rhall
06-05-2013, 09:56 AM
CF80, how old is this data? The long_ and strobe_ bas.h5 files do not contain any data, and as far as I know where never generate by the PacBio RS. Try removing these files and trying again.

BT, HGAP requires a minimum of ~70x of raw read coverage to assembly a complete genome. A rough calculation, I would estimate that you would need ~8 cells of data (with long insert size) to get a good assembly. On the new RSII this would be fewer, but still dependent on the library size.

Also note, you set an expected genome size of 5MB in the HGAp run.
SMRT portal has a limit of 10MB for HGAP assemblies, it is possible to assembly up to 130MB using the command line SMRT Pipe.