Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing Illumina data onto SMRT portal

    Greetings, I am trying to run the RS_Celera assembler on the SMRT Portal. I have set this up the "easy" way using amazon cloud computing. I am able to move the PacBio data onto the cloud. However, I am not able to get the Illumina data on. However, I am not entirely sure where it should go. I have tried placing it in the reference sequence files (although they are not "reference genomes"), but this fails. Any suggestions are useful.

  • #2
    Do you have original data for the SMRTcell(s) you are trying to import?

    Here is a list of minimal files you need (*.metadata.xml file and all *.bax.h5 and *.bas.h5): https://github.com/PacificBioscience...to-SMRT-Portal

    BTW: See this if you are specifically looking for CeleraAssembler. http://seqanswers.com/forums/showthread.php?t=49846

    Comment


    • #3
      Yes, all the xml, .bas.h5, .bax.h5 files are all there. Its the short Illumina data that I'm trying to place, or figure out where to place within the context of the SMRT Portal. I can run HGAP on the smrt-cells, but we really want to do a hybrid assebly. I can navigate to the Manage and Import section, select managed protocols and select the RS_CeleraAssembler. I then copy this, and save as RS_CeleraAssemblerModified. In the Protocol Preset Details the FASTQ to Correct with is Empty. I have tried entering the name of the Fastq file that I have uploaded to /opt/smrtanalysis/common/references_dropbox, but it will not recognize this file. So, i am guessing that the path it is looking for is different than this. I'll look back at the git-hub post again. Thanks.

      Comment


      • #4
        Here is a post from Dr. Hall who works at PacBio from the thread I linked above: http://seqanswers.com/forums/showpos...46&postcount=5

        Hybrid assembly is no longer supported in SMRTportal v.2.3. You may have to install/use an older version of SMRTportal if you want to use hybrid assembly.

        Comment


        • #5
          To add, the hybrid assembly support in SMRT Analysis was never great, I would not recommend going back to a previous version. Hybrid assembly as implemented in PBcR , ECTools, or dbg2olc is going to be much more straightforward and give better results.

          Comment


          • #6
            Greetings, I have been trying to run the celera assembler using STAR-cluster on amazon-EC2 on some PacBio long reads (1 smrt cell) and paired end Illumina data (100 bp PE). The genomes we are trying to assemble are ~1.6-2.0 Mbp. I am 99% sure I have installed the assembler correctly, as I was able to perform one of the example/tutorial assemblies of a small virus (A006, 35kbp mock genome). I have been able produce the .frg files for the Illumina data, and I have filtered_subreads.fastq from the sequencing center. However, when I run ./pacBioToCA I keep getting errors that I believe have something to do with SGE conditions (the full command/output is attached, command_line_output);

            qsub: illegal -p value
            qsub: illegal -c value ""

            I was under the impression that these values would be defined in the pacbio.spec file (see attached), but they are not, and I am not sure how to modify these. I'm pretty new to the CeleraAssembler and running SGE jobs on STAR, so any comments/suggestions/hints are welcome. I
            Attached Files

            Comment


            • #7
              Originally posted by tonybert View Post
              qsub: illegal -p value
              qsub: illegal -c value ""
              Hi Tony,

              I'm not familiar with STAR cluster, or how/which job schedulers it is configured for, however I can tell you that particular error message is because CeleraAssembler is configured to run on SGE, however for whatever reason, the binaries that are in your path are for PBS.

              This is a confusing issue as both SGE && PBS have some similar binary names, which can result in the user being confused what job scheduler is actually installed.

              Comment


              • #8
                Originally posted by gconcepcion View Post
                Hi Tony,

                I'm not familiar with STAR cluster, or how/which job schedulers it is configured for, however I can tell you that particular error message is because CeleraAssembler is configured to run on SGE, however for whatever reason, the binaries that are in your path are for PBS.

                This is a confusing issue as both SGE && PBS have some similar binary names, which can result in the user being confused what job scheduler is actually installed.
                To illustrate what I mean, see this example:

                ### SGE is configured as default here (this should fail because I don't actually have your script on hand)
                -bash-3.2$ qsub -A assembly -pe threads 4 -cwd -N "pBcR_asm" -j y -o /home/ubuntu/wgs-8.3rc1/Linux-amd64/bin//tempec_pacbio
                Unable to read script file because of error: no input read from stdin

                ### When I add the PBS job scheduler binaries to my PATH (ahead of our SGE binaries) you see the message that you referenced:
                -bash-3.2$ export PATH=/opt/pbs/bin:$PATH
                -bash-3.2$ qsub -A assembly -pe threads 4 -cwd -N "pBcR_asm" -j y -o /home/ubuntu/wgs-8.3rc1/Linux-amd64/bin//tempec_pacbio
                qsub: illegal -p value
                qsub: illegal -c value ""
                usage: qsub [-a date_time] [-A account_string] [-b secs]
                [-c [ none | { enabled | periodic | shutdown |
                depth=<int> | dir=<path> | interval=<minutes>}... ]
                [-C directive_prefix] [-d path] [-D path]
                [-e path] [-h] [-I] [-j oe] [-k {oe}] [-l resource_list] [-m n|{abe}]
                [-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user] [-q queue]
                [-r y|n] [-S path] [-t number_to_submit] [-T type] [-u user_list] [-w] path
                [-W additional_attributes] [-v variable_list] [-V ] [-x] [-X] [-z] [script]

                Comment


                • #9
                  Out of interest, why are you trying to use illumina reads for a hybrid assembly? For such a small genome 1 SMRT Cell of data should be more than enough to assembly the PacBio data denovo.

                  Comment


                  • #10
                    Thanks for the prompt response. So its more of an environment issue?

                    Comment


                    • #11
                      Hi rhall, we tried HGAP with the smrt cell for the 3 genomes. We were expecting to have one contiguous, single contig after HGAP assembly. This was not the case. I have to look back at my notes, but I believe we had between 6-9 contigs for per genome. Additionally, we were under the impression that error-correcting with Illumina would lead to higher quality, higher accuracy genome draft genomes. That said, my initial HGAP assembly was run using the SMRT portal (v 1.3 i believe) about a year or more ago, i didn't really put to much effort to tuning the assembly parameters.

                      Comment


                      • #12
                        Originally posted by tonybert View Post
                        Thanks for the prompt response. So its more of an environment issue?
                        For the error message that you indicated, yes, it is an environment thing, and from what I can tell you have PBS configured in your environment as the job scheduler instead of SGE.
                        Celera Assembler (last time I checked: http://wgs-assembler.sourceforge.net...ndex.php/RunCA) is only able to run on SGE or LSF, so PBS would pose a problem.

                        I would like to echo rhall's sentiment though, that if your genome size is only 1.8-2.0 Mbp, why are you even bothering with Hybrid assembly? If you have 1 SMRTCell of data, you should have PLENTY of excess coverage to run HGAP_3 and get a MUCH better result than any hybrid illumina-pacbio strategy.

                        Assemble with pacbio and use the Illumina short reads to validate the assembly.
                        Last edited by gconcepcion; 03-18-2015, 12:56 PM. Reason: accidentally a word

                        Comment


                        • #13
                          It is extremely unlikely that a hybrid assembly will give better results than all PacBio, particularly if the Pacbio assembly is not coverage limited (>40x per genome).
                          http://www.biomedcentral.com/content...-14-9-r101.pdf

                          Comment


                          • #14
                            I am in the "PacBio should be enough" camp but to be fair PacBio data tonybert has needs to be of good quality. The "sweet spot" for getting a good assembly appears to be library specific in our hands.

                            @tonybert: Can you post stats from a "RS_subreads" run for your SMRTcell?

                            Comment


                            • #15
                              I totally agree, the point I wanted to make was that if you do have plenty of coverage of PacBio (likely given the size of the genome), regardless of subread length, a hybrid approach will not improve the assembly as it does not add any long range information. It is the long range information that helps complete assemblies. A hybrid approach only helps when you have limited coverage.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              57 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              56 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X