Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error correction of PACbio using Illumina? When and how? no Ref. Genome

    Hi all,

    We've sequenced the transcriptom of a species (genome is about 3GB), but NO reference genome is available, using PACbio and Illumina.

    For PACbio, I've got 6 zip file, each with 3 .bax.h5 files in it. (This means I've got 6 SMRT cells, right?)
    librarysize<1kb: 1 file
    1~2kb: 2 files
    2~3kb: 2 files
    >3kb: 1 file

    I'm planning to get the full length transcripts from PACbio data and conduct expression analysis using Illumina data.

    Following the RS_IsoSeq tutorial, I will follow these steps:
    step 1) get read of inserts using ConsensusTools.sh for each zip file
    step 2) get full length reads using pbtranscript.py classify in RS_IsoSeq
    step 3) merge all output from step (2), and do ICE and QV using pbtranscript.py cluster in RS_IsoSeq (this may take long time)
    step 4) remove redundant transcripts using tofu-scripts.

    I learned from many literatures that subreads were corrected by NGS short reads. But this error correction step would require huge computer resources and time.

    Is it neccesary to do NGS-based error correction for pacbio data if I have done ICE and QV? But if I go for it, at which step should I do correction, on read of inserts (after step 1), or on full length reads (after step 2) or after ICE and QV (step 3)?
    What would you recommend for error correction? I know there are some tools, like LSR, prooveread, PacBioToCA, but I don't know which one runs faster and needs smaller computer resource.

    When I get full length transcripts from PACbio data, is it sufficient to do expression analysis by simply mapping NGS data to these transcripts (because no ref. genome is available)? Or should I perform assembling on PACbio and NGS data? -- but it seems that I cannot do assembly if I only have transcriptome data, right?

    Thank you very much!

  • #2
    Originally posted by xhuister View Post
    For PACbio, I've got 6 zip file, each with 3 .bax.h5 files in it. (This means I've got 6 SMRT cells, right?)
    Yes

    Originally posted by xhuister View Post
    Following the RS_IsoSeq tutorial, I will follow these steps:
    step 1) get read of inserts using ConsensusTools.sh for each zip file
    step 2) get full length reads using pbtranscript.py classify in RS_IsoSeq
    step 3) merge all output from step (2), and do ICE and QV using pbtranscript.py cluster in RS_IsoSeq (this may take long time)
    step 4) remove redundant transcripts using tofu-scripts.
    1–3: Correct
    4: Since you don't have a genome, tofu won't be able to remove redundant transcripts since it does so using the genomic coordinates. You can use other tools like CD-HIT.


    Originally posted by xhuister View Post
    I learned from many literatures that subreads were corrected by NGS short reads. But this error correction step would require huge computer resources and time.
    Exactly

    Originally posted by xhuister View Post
    Is it neccesary to do NGS-based error correction for pacbio data if I have done ICE and QV?
    No, it won't be necessary.

    Originally posted by xhuister View Post
    When I get full length transcripts from PACbio data, is it sufficient to do expression analysis by simply mapping NGS data to these transcripts (because no ref. genome is available)? Or should I perform assembling on PACbio and NGS data? -- but it seems that I cannot do assembly if I only have transcriptome data, right?
    You can directly map illumina reads to pacBio annotation using Bowtie2 (with "-a") and then do quantification with tools like eXpress.

    Comment


    • #3
      Thank bowhan very much for your detailed answer!

      I encountered a problem using Quiver, I'm wondering whether I should post it here or start a new thread.

      I'm running ICE and Quiver using pbtranscript.py cluster, I tried many ways but all failed.

      (1)
      Code:
      pbtranscript.py cluster "isoseq_flnc.fasta" "final.consensus.fa" \
        --nfl_fa "isoseq_nfl.fasta" -d cluster --ccs_fofn "reads_of_insert.fofn" \
        --bas_fofn "ccs.input.fofn" --cDNA_size between1k2k --quiver \
         --blasr_nproc 24 --quiver_nproc 8
      Here, ccs.input.fofn is filenames of full path of bax.h5 files.
      An error ocurred when Quiver started:
      Failure: This .cmp.h5 file lacks some of the QV data tracks that are required for optimal performance of the Quiver algorithm.
      I checked the log, and found the command stopped on: c42to449.ref.fa (which is one of the split chunks I guess)

      (2) I googled, and found that the cpm.h5 file may be wrong.
      I used the bas.h5 file and pbalign to get cmp.h5 file.
      Code:
      pbalign --forQuiver /Analysis_Results/xx.bas.h5 \
      cluster/quivered/c42to449.ref.fa \
      cluster/quivered/c42to449_1.cmp.h5
      Then I did quiver according to the command in the log file, using:
      Code:
      quiver /cluster/quivered/c42to449.cmp.h5 -v -j8 -r \
      /cluster/quivered/c42to449.ref.fa -o \
      /cluster/quivered/c42to449.quivered.fq
      It can generate c42to449.quivered.fq. But it is weird that the cmp.h5 file from pbalign is 400M, but this file from step(1) or the .SAM file is only 40M. I don't know which one is correct.

      (3) So, I tried to use ice_polish.py according to the tutorial on Github.
      I started this command from the output dir (/cluster/) of ICE:
      Code:
      ice_polish.py --ccs_fofn "/reads_of_insert.fofn" --bas_fofn "/css.input.fofn" \
      --quiver_nproc 8 /cluster/ "/isoseq_nfl.fasta"
      It is wrong to use directly the .bas.h5 file for --bas_fofn option (but pbalign could)
      error: argument --bas_fofn: invalid validate_fofn value: '....bas.h5'

      If I used the fofn file with full path of .bax.h5 file names, it was also wrong:
      RuntimeError: CMD exited with a non-zero code: bash cluster/quivered/c42to449.sh 1>cluster/log/quivered/c42to449.sh.olog 2>cluster/log/quivered/c42to449.sh.elog,
      Failed to run Quiver
      Error log: cluster/log/quivered/c42to449.sh.elog
      Out log: cluster/log/quivered/c42to449.sh.olog

      If I used the fofn file with full path of .bas.h5, it was also wrong:
      ERROR, file type '.h5' is not understood to be one of pls.h5, fasta, nor fastq.
      Failure: This .cmp.h5 file lacks some of the QV data tracks that are required for optimal performance of the Quiver algorithm.
      For optimal results use the ResequencingQVs workflow in SMRTPortal with bas.h5 files from an instrument using software version 1.3.1 or later, or the --forQuiver option to pbalign.

      In one word, I'd like to do ICE and Quiver, following the steps stated in the tutorial, but the Quiver step did not work.
      1) ICE step, it works.
      Code:
      pbtranscript.py cluster "/ccs_output/isoseq_flnc.fasta" \
      "/ccs_output/final.consensus1.fa" \
       -d css.input/ccs_output/clusterOut --cDNA_size between1k2k --blasr_nproc 24
      2) quiver step, error occurred, how to assign --bas_fofn option?
      Code:
      ice_polish.py --ccs_fofn "/ccs_output/reads_of_insert.fofn" \
      --bas_fofn "/ccs_output/ccs.input.fofn" \
      --quiver_nproc 8  css.input/ccs_output/cluster/ \
      "/ccs_output/isoseq_nfl.fasta"

      Comment


      • #4
        Are all of your scripts from smrtanalysis? Have you manually installed any components from github? Did you invoke `smrtshell` before running anything?

        If you want to manually re-run alignment + quiver to see where things go wrong, you can find the alignment command on the header section of the sam file (with `samtools view -SH`), and the rest in the script `/quivered/c42to449.sh`, which includes a series of commands "samtoh5", "loadPulses", "cmph5tools.py", "loadChemistry.py", and "quiver". Among them, "loadPulses" will greatly increase the sizes of the h5 file. So probably there is where things go wrong. I suggest you to run them individually to see if you can reproduce the error, and use "which" command to make sure the commands are from smrtanalysis.

        Comment


        • #5
          Originally posted by bowhan View Post
          Are all of your scripts from smrtanalysis? Have you manually installed any components from github? Did you invoke `smrtshell` before running anything?
          No, I did not install other scripts from github. I only installed:
          smrtanalysis_2.3.0.140936.run
          smrtanalysis-patch_2.3.0.140936.p5.run

          Sorry, I don't know what is " invoke `smrtshell`"? I installed SMRT analysis on a LSF computer cluster, and ran one command "pbtranscript.py cluster..." by submitting a bsub job.


          Originally posted by bowhan View Post
          If you want to manually re-run alignment + quiver to see where things go wrong, you can find the alignment command on the header section of the sam file (with `samtools view -SH`), and the rest in the script `/quivered/c42to449.sh`, which includes a series of commands "samtoh5", "loadPulses", "cmph5tools.py", "loadChemistry.py", and "quiver". Among them, "loadPulses" will greatly increase the sizes of the h5 file. So probably there is where things go wrong. I suggest you to run them individually to see if you can reproduce the error, and use "which" command to make sure the commands are from smrtanalysis.
          In c42to449.sh:
          Code:
          samtoh5 /cluster/quivered/c42to449.sam /cluster/quivered/c42to449.ref.fa /cluster/quivered/c42to449.cmp.h5 -smrtTitle
          
          gzip /cluster/quivered/c42to449.sam
          
          loadPulses /ccs.input.fofn "/cluster/quivered/c42to449.cmp.h5" -byread -metrics QualityValue,InsertionQV,MergeQV,DeletionQV,DeletionTag,SubstitutionTag,SubstitutionQV
          
          cmph5tools.py sort /cluster/quivered/c42to449.cmp.h5
          
          samtools faidx /cluster/quivered/c42to449.ref.fa
          
          loadChemistry.py /ccs.input.fofn /cluster/quivered/c42to449.cmp.h5
          
          quiver /cluster/quivered/c42to449.cmp.h5 -v -j8 -r /cluster/quivered/c42to449.ref.fa -o /cluster/quivered/c42to449.quivered.fq
          I got an information from loadPulses:
          [INFO] 2016-03-27T08:59:44 [loadPulses] started.
          ' is not understood to be one of pls.h5, fasta, nor fastq.

          When I ran loadPulses using .bas.h5 file instead of "ccs.input.fofn" which storted full paths of .bax.h5 files:
          [INFO] 2016-03-27T09:02:30 [loadPulses] started.
          loading 20714 alignments for movie 1
          loading 20714 alignments for movie 1
          loading 20714 alignments for movie 1
          [INFO] 2016-03-27T09:08:13 [loadPulses] ended.

          Then I got a "c42to449.cmp.h5" file of 268M from loadPulse. I really confused. The original cmp.h5 file is 40M. While this file from "pbalign --forQuiver" is 400M.

          The quiver output fq files from loadPulse's cmp.h5 file and the pbalign's cmp.h5 file are not the same. But the sequences seem to be same ignoring upper or lower cases, the quality value is not the same.
          For example:
          One quiver sequence from loadPulse:
          @c51|quiver
          caaaccGGGCACTACGGTGAGACGTGAAAACA....
          +
          !!!!!!0?F=HEINPLQQRQQQQQRQHSSRRP....

          The same sequence from pbalign:
          @c51|quiver
          CAAACCGGGCACTACGGTGAGACGTGAAAAC....
          +
          "QQQPRQRQQPQRSPQQQTSQRRQQPRTTUS....

          How can I decide which cmp.h5 file is correct and to use?

          Another problem is even though I can mannually run commands in c42to449.sh successfully. I infer that c42to449.sh is only one split chunk from the "pbtranscript.py cluster" or "ice_polish.py" command. I cannot run all other chunks by mannually running the .sh file (and the .sh file is not available for other chunks).

          I guess the problem is how to assign the option "--bas_fofn" for ice_polish.py or "pbtranscript.py cluster". In my case, using .bas.h5 file or "ccs.input.fofn" file are not right.

          Comment


          • #6
            smrtshell erases your own environmental variables and loads only the ones related to smrtanalysis, thus ensures the correct scripts been called. It's located in the smrtanalysis directory, under the folder "smrtcmds/bin".

            Can you please try to re-run another new job in a clean directory with command:
            Code:
            /path/to/smrtshell -c "pbtranscript.py rest_of_args..."
            and see if there is any errors.

            Comment


            • #7
              Thank you, bowhan.

              I happened to find the problem may be the ccs.input.fofn file, because I wrote this file in Windows OS, and the error message of pbtranscript.py was:
              ERROR, file type '
              .h5' is not understood to be one of pls.h5, fasta, nor fastq.


              Sorry that in my post on #3 floor, I deleted the blank after ', the error message became ".h5", which may mislead you. I now used "ls *.bax.h5 > ccs.input.fofn" to rewrite a fofn file and am testing the script again (it may take few hours). If this still does not work, I'll try your way to re-run a new job to see what happens.

              Comment


              • #8
                Thanks for the update.
                Please be sure to include the full path of the file, with probably `ls -1 $PWD/*bax.h5` or `ls *bax.h5 | xargs readlink -f`.

                Comment


                • #9
                  Hi bowhan, just to let you know, finally, I got the quiver results. It is my mistake that I used wrong fofn file, which should not be created under Windows OS and blank line(s) are not allowed. (It may be OK to use dos2unix xx.fofn to convert the format, I haven't tested it yet).

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  25 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  27 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  24 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X