SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
RS_CeleraAssembler not included in SMRT portal v2.3 macb Pacific Biosciences 5 02-02-2015 07:37 AM
Diary: Assembly in SMRT Portal 2.1.1 with HGAP+CA 8.1 pag Pacific Biosciences 20 09-06-2014 09:20 AM
SMRT portal errors bsp017 Pacific Biosciences 3 05-26-2014 04:57 AM
imprting Raw reads into smrt Portal coldturkey Pacific Biosciences 38 12-04-2013 12:04 PM
Importing Genome in IGB and Chip Seq data in IGV ayushraman Bioinformatics 1 09-24-2011 05:42 AM

Reply
 
Thread Tools
Old 03-13-2015, 06:30 PM   #1
tonybert
Member
 
Location: seattle

Join Date: Aug 2012
Posts: 38
Default Importing Illumina data onto SMRT portal

Greetings, I am trying to run the RS_Celera assembler on the SMRT Portal. I have set this up the "easy" way using amazon cloud computing. I am able to move the PacBio data onto the cloud. However, I am not able to get the Illumina data on. However, I am not entirely sure where it should go. I have tried placing it in the reference sequence files (although they are not "reference genomes"), but this fails. Any suggestions are useful.
tonybert is offline   Reply With Quote
Old 03-14-2015, 05:05 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

Do you have original data for the SMRTcell(s) you are trying to import?

Here is a list of minimal files you need (*.metadata.xml file and all *.bax.h5 and *.bas.h5): https://github.com/PacificBioscience...to-SMRT-Portal

BTW: See this if you are specifically looking for CeleraAssembler. http://seqanswers.com/forums/showthread.php?t=49846
GenoMax is offline   Reply With Quote
Old 03-14-2015, 03:01 PM   #3
tonybert
Member
 
Location: seattle

Join Date: Aug 2012
Posts: 38
Default

Yes, all the xml, .bas.h5, .bax.h5 files are all there. Its the short Illumina data that I'm trying to place, or figure out where to place within the context of the SMRT Portal. I can run HGAP on the smrt-cells, but we really want to do a hybrid assebly. I can navigate to the Manage and Import section, select managed protocols and select the RS_CeleraAssembler. I then copy this, and save as RS_CeleraAssemblerModified. In the Protocol Preset Details the FASTQ to Correct with is Empty. I have tried entering the name of the Fastq file that I have uploaded to /opt/smrtanalysis/common/references_dropbox, but it will not recognize this file. So, i am guessing that the path it is looking for is different than this. I'll look back at the git-hub post again. Thanks.
tonybert is offline   Reply With Quote
Old 03-14-2015, 07:39 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

Here is a post from Dr. Hall who works at PacBio from the thread I linked above: http://seqanswers.com/forums/showpos...46&postcount=5

Hybrid assembly is no longer supported in SMRTportal v.2.3. You may have to install/use an older version of SMRTportal if you want to use hybrid assembly.
GenoMax is offline   Reply With Quote
Old 03-16-2015, 08:07 AM   #5
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

To add, the hybrid assembly support in SMRT Analysis was never great, I would not recommend going back to a previous version. Hybrid assembly as implemented in PBcR , ECTools, or dbg2olc is going to be much more straightforward and give better results.
rhall is offline   Reply With Quote
Old 03-18-2015, 12:25 PM   #6
tonybert
Member
 
Location: seattle

Join Date: Aug 2012
Posts: 38
Default

Greetings, I have been trying to run the celera assembler using STAR-cluster on amazon-EC2 on some PacBio long reads (1 smrt cell) and paired end Illumina data (100 bp PE). The genomes we are trying to assemble are ~1.6-2.0 Mbp. I am 99% sure I have installed the assembler correctly, as I was able to perform one of the example/tutorial assemblies of a small virus (A006, 35kbp mock genome). I have been able produce the .frg files for the Illumina data, and I have filtered_subreads.fastq from the sequencing center. However, when I run ./pacBioToCA I keep getting errors that I believe have something to do with SGE conditions (the full command/output is attached, command_line_output);

qsub: illegal -p value
qsub: illegal -c value ""

I was under the impression that these values would be defined in the pacbio.spec file (see attached), but they are not, and I am not sure how to modify these. I'm pretty new to the CeleraAssembler and running SGE jobs on STAR, so any comments/suggestions/hints are welcome. I
Attached Files
File Type: txt pacbio.spec.txt (1.1 KB, 2 views)
File Type: txt command_line_output.txt (4.1 KB, 13 views)
tonybert is offline   Reply With Quote
Old 03-18-2015, 12:31 PM   #7
gconcepcion
Member
 
Location: Menlo Park

Join Date: Dec 2010
Posts: 68
Default

Quote:
Originally Posted by tonybert View Post
qsub: illegal -p value
qsub: illegal -c value ""
Hi Tony,

I'm not familiar with STAR cluster, or how/which job schedulers it is configured for, however I can tell you that particular error message is because CeleraAssembler is configured to run on SGE, however for whatever reason, the binaries that are in your path are for PBS.

This is a confusing issue as both SGE && PBS have some similar binary names, which can result in the user being confused what job scheduler is actually installed.
gconcepcion is offline   Reply With Quote
Old 03-18-2015, 12:35 PM   #8
gconcepcion
Member
 
Location: Menlo Park

Join Date: Dec 2010
Posts: 68
Default

Quote:
Originally Posted by gconcepcion View Post
Hi Tony,

I'm not familiar with STAR cluster, or how/which job schedulers it is configured for, however I can tell you that particular error message is because CeleraAssembler is configured to run on SGE, however for whatever reason, the binaries that are in your path are for PBS.

This is a confusing issue as both SGE && PBS have some similar binary names, which can result in the user being confused what job scheduler is actually installed.
To illustrate what I mean, see this example:

### SGE is configured as default here (this should fail because I don't actually have your script on hand)
-bash-3.2$ qsub -A assembly -pe threads 4 -cwd -N "pBcR_asm" -j y -o /home/ubuntu/wgs-8.3rc1/Linux-amd64/bin//tempec_pacbio
Unable to read script file because of error: no input read from stdin

### When I add the PBS job scheduler binaries to my PATH (ahead of our SGE binaries) you see the message that you referenced:
-bash-3.2$ export PATH=/opt/pbs/bin:$PATH
-bash-3.2$ qsub -A assembly -pe threads 4 -cwd -N "pBcR_asm" -j y -o /home/ubuntu/wgs-8.3rc1/Linux-amd64/bin//tempec_pacbio
qsub: illegal -p value
qsub: illegal -c value ""
usage: qsub [-a date_time] [-A account_string] [-b secs]
[-c [ none | { enabled | periodic | shutdown |
depth=<int> | dir=<path> | interval=<minutes>}... ]
[-C directive_prefix] [-d path] [-D path]
[-e path] [-h] [-I] [-j oe] [-k {oe}] [-l resource_list] [-m n|{abe}]
[-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user] [-q queue]
[-r y|n] [-S path] [-t number_to_submit] [-T type] [-u user_list] [-w] path
[-W additional_attributes] [-v variable_list] [-V ] [-x] [-X] [-z] [script]
gconcepcion is offline   Reply With Quote
Old 03-18-2015, 12:42 PM   #9
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

Out of interest, why are you trying to use illumina reads for a hybrid assembly? For such a small genome 1 SMRT Cell of data should be more than enough to assembly the PacBio data denovo.
rhall is offline   Reply With Quote
Old 03-18-2015, 12:44 PM   #10
tonybert
Member
 
Location: seattle

Join Date: Aug 2012
Posts: 38
Default

Thanks for the prompt response. So its more of an environment issue?
tonybert is offline   Reply With Quote
Old 03-18-2015, 12:51 PM   #11
tonybert
Member
 
Location: seattle

Join Date: Aug 2012
Posts: 38
Default

Hi rhall, we tried HGAP with the smrt cell for the 3 genomes. We were expecting to have one contiguous, single contig after HGAP assembly. This was not the case. I have to look back at my notes, but I believe we had between 6-9 contigs for per genome. Additionally, we were under the impression that error-correcting with Illumina would lead to higher quality, higher accuracy genome draft genomes. That said, my initial HGAP assembly was run using the SMRT portal (v 1.3 i believe) about a year or more ago, i didn't really put to much effort to tuning the assembly parameters.
tonybert is offline   Reply With Quote
Old 03-18-2015, 12:55 PM   #12
gconcepcion
Member
 
Location: Menlo Park

Join Date: Dec 2010
Posts: 68
Default

Quote:
Originally Posted by tonybert View Post
Thanks for the prompt response. So its more of an environment issue?
For the error message that you indicated, yes, it is an environment thing, and from what I can tell you have PBS configured in your environment as the job scheduler instead of SGE.
Celera Assembler (last time I checked: http://wgs-assembler.sourceforge.net...ndex.php/RunCA) is only able to run on SGE or LSF, so PBS would pose a problem.

I would like to echo rhall's sentiment though, that if your genome size is only 1.8-2.0 Mbp, why are you even bothering with Hybrid assembly? If you have 1 SMRTCell of data, you should have PLENTY of excess coverage to run HGAP_3 and get a MUCH better result than any hybrid illumina-pacbio strategy.

Assemble with pacbio and use the Illumina short reads to validate the assembly.

Last edited by gconcepcion; 03-18-2015 at 12:56 PM. Reason: accidentally a word
gconcepcion is offline   Reply With Quote
Old 03-18-2015, 12:58 PM   #13
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

It is extremely unlikely that a hybrid assembly will give better results than all PacBio, particularly if the Pacbio assembly is not coverage limited (>40x per genome).
http://www.biomedcentral.com/content...-14-9-r101.pdf
rhall is offline   Reply With Quote
Old 03-18-2015, 03:18 PM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

I am in the "PacBio should be enough" camp but to be fair PacBio data tonybert has needs to be of good quality. The "sweet spot" for getting a good assembly appears to be library specific in our hands.

@tonybert: Can you post stats from a "RS_subreads" run for your SMRTcell?
GenoMax is offline   Reply With Quote
Old 03-18-2015, 07:37 PM   #15
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

I totally agree, the point I wanted to make was that if you do have plenty of coverage of PacBio (likely given the size of the genome), regardless of subread length, a hybrid approach will not improve the assembly as it does not add any long range information. It is the long range information that helps complete assemblies. A hybrid approach only helps when you have limited coverage.
rhall is offline   Reply With Quote
Old 03-19-2015, 12:21 PM   #16
tonybert
Member
 
Location: seattle

Join Date: Aug 2012
Posts: 38
Default

Hi GenoMax, the output from our PacBio sequencing facility consists of a directory that contains another directory (Analysis Results), filtered_subreads.fastq.gz, DS_Store, and a file with a .xml extension. Analysis results has all the .bax.h5 and .bas.h5 files. Would RS_subreads be something that would come off of the smrt portal?

Quote:
Originally Posted by GenoMax View Post
I am in the "PacBio should be enough" camp but to be fair PacBio data tonybert has needs to be of good quality. The "sweet spot" for getting a good assembly appears to be library specific in our hands.

@tonybert: Can you post stats from a "RS_subreads" run for your SMRTcell?
tonybert is offline   Reply With Quote
Old 03-19-2015, 12:30 PM   #17
tonybert
Member
 
Location: seattle

Join Date: Aug 2012
Posts: 38
Default

Hi GenoMax, this is from the smrt portal.
Mean Subread length 2,677 N50 3,055
Total Number of Bases 447,142,427 Number of Reads 166,982

I was trying to run HGAP.2, but it run never completed, it failed.
tonybert is offline   Reply With Quote
Old 03-19-2015, 12:41 PM   #18
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

The subread length will be the limiting factor, but your coverage is sufficient for PacBio only assembly. You should be able to run HGAP.3 on the cloud instance of SMRT Analysis without using Star cluster, although it will require one of the larger computational instances.
For the HGAP.2 run, was that on a cloud instance? Do you have any logs, specific errors?
rhall is offline   Reply With Quote
Old 03-19-2015, 12:48 PM   #19
tonybert
Member
 
Location: seattle

Join Date: Aug 2012
Posts: 38
Default

Yes, see atttached, this is from the smrt analysis portal.
Attached Files
File Type: gz smrtHGAP.2_output.txt.gz (5.2 KB, 2 views)
tonybert is offline   Reply With Quote
Old 03-19-2015, 12:56 PM   #20
tonybert
Member
 
Location: seattle

Join Date: Aug 2012
Posts: 38
Default

If the subread length is ~2500, then I should probably decrease the minimum seed length, correct? The default is 6000. Could this be why its failing?
tonybert is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:40 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO