Seqanswers Leaderboard Ad

**bowhan** · 03-24-2016, 05:11 AM

Originally posted by xhuister View Post

For PACbio, I've got 6 zip file, each with 3 .bax.h5 files in it. (This means I've got 6 SMRT cells, right?)

Yes

Originally posted by xhuister View Post

Following the RS_IsoSeq tutorial, I will follow these steps:
step 1) get read of inserts using ConsensusTools.sh for each zip file
step 2) get full length reads using pbtranscript.py classify in RS_IsoSeq
step 3) merge all output from step (2), and do ICE and QV using pbtranscript.py cluster in RS_IsoSeq (this may take long time)
step 4) remove redundant transcripts using tofu-scripts.

1–3: Correct
4: Since you don't have a genome, tofu won't be able to remove redundant transcripts since it does so using the genomic coordinates. You can use other tools like CD-HIT.

Originally posted by xhuister View Post

I learned from many literatures that subreads were corrected by NGS short reads. But this error correction step would require huge computer resources and time.

Exactly

Originally posted by xhuister View Post

Is it neccesary to do NGS-based error correction for pacbio data if I have done ICE and QV?

No, it won't be necessary.

Originally posted by xhuister View Post

When I get full length transcripts from PACbio data, is it sufficient to do expression analysis by simply mapping NGS data to these transcripts (because no ref. genome is available)? Or should I perform assembling on PACbio and NGS data? -- but it seems that I cannot do assembly if I only have transcriptome data, right?

You can directly map illumina reads to pacBio annotation using Bowtie2 (with "-a") and then do quantification with tools like eXpress.

**xhuister** · 03-24-2016, 05:21 PM

Thank bowhan very much for your detailed answer!

I encountered a problem using Quiver, I'm wondering whether I should post it here or start a new thread.

I'm running ICE and Quiver using pbtranscript.py cluster, I tried many ways but all failed.

(1)

Code:

pbtranscript.py cluster "isoseq_flnc.fasta" "final.consensus.fa" \
  --nfl_fa "isoseq_nfl.fasta" -d cluster --ccs_fofn "reads_of_insert.fofn" \
  --bas_fofn "ccs.input.fofn" --cDNA_size between1k2k --quiver \
   --blasr_nproc 24 --quiver_nproc 8

Here, ccs.input.fofn is filenames of full path of bax.h5 files.
An error ocurred when Quiver started:
Failure: This .cmp.h5 file lacks some of the QV data tracks that are required for optimal performance of the Quiver algorithm.
I checked the log, and found the command stopped on: c42to449.ref.fa (which is one of the split chunks I guess)

(2) I googled, and found that the cpm.h5 file may be wrong.
I used the bas.h5 file and pbalign to get cmp.h5 file.

Code:

pbalign --forQuiver /Analysis_Results/xx.bas.h5 \
cluster/quivered/c42to449.ref.fa \
cluster/quivered/c42to449_1.cmp.h5

Then I did quiver according to the command in the log file, using:

Code:

quiver /cluster/quivered/c42to449.cmp.h5 -v -j8 -r \
/cluster/quivered/c42to449.ref.fa -o \
/cluster/quivered/c42to449.quivered.fq

It can generate c42to449.quivered.fq. But it is weird that the cmp.h5 file from pbalign is 400M, but this file from step(1) or the .SAM file is only 40M. I don't know which one is correct.

(3) So, I tried to use ice_polish.py according to the tutorial on Github.
I started this command from the output dir (/cluster/) of ICE:

Code:

ice_polish.py --ccs_fofn "/reads_of_insert.fofn" --bas_fofn "/css.input.fofn" \
--quiver_nproc 8 /cluster/ "/isoseq_nfl.fasta"

It is wrong to use directly the .bas.h5 file for --bas_fofn option (but pbalign could)
error: argument --bas_fofn: invalid validate_fofn value: '....bas.h5'

If I used the fofn file with full path of .bax.h5 file names, it was also wrong:
RuntimeError: CMD exited with a non-zero code: bash cluster/quivered/c42to449.sh 1>cluster/log/quivered/c42to449.sh.olog 2>cluster/log/quivered/c42to449.sh.elog,
Failed to run Quiver
Error log: cluster/log/quivered/c42to449.sh.elog
Out log: cluster/log/quivered/c42to449.sh.olog

If I used the fofn file with full path of .bas.h5, it was also wrong:
ERROR, file type '.h5' is not understood to be one of pls.h5, fasta, nor fastq.
Failure: This .cmp.h5 file lacks some of the QV data tracks that are required for optimal performance of the Quiver algorithm.
For optimal results use the ResequencingQVs workflow in SMRTPortal with bas.h5 files from an instrument using software version 1.3.1 or later, or the --forQuiver option to pbalign.

In one word, I'd like to do ICE and Quiver, following the steps stated in the tutorial, but the Quiver step did not work.
1) ICE step, it works.

Code:

pbtranscript.py cluster "/ccs_output/isoseq_flnc.fasta" \
"/ccs_output/final.consensus1.fa" \
 -d css.input/ccs_output/clusterOut --cDNA_size between1k2k --blasr_nproc 24

2) quiver step, error occurred, how to assign --bas_fofn option?

Code:

ice_polish.py --ccs_fofn "/ccs_output/reads_of_insert.fofn" \
--bas_fofn "/ccs_output/ccs.input.fofn" \
--quiver_nproc 8  css.input/ccs_output/cluster/ \
"/ccs_output/isoseq_nfl.fasta"

**bowhan** · 03-25-2016, 05:22 AM

Are all of your scripts from smrtanalysis? Have you manually installed any components from github? Did you invoke `smrtshell` before running anything?

If you want to manually re-run alignment + quiver to see where things go wrong, you can find the alignment command on the header section of the sam file (with `samtools view -SH`), and the rest in the script `/quivered/c42to449.sh`, which includes a series of commands "samtoh5", "loadPulses", "cmph5tools.py", "loadChemistry.py", and "quiver". Among them, "loadPulses" will greatly increase the sizes of the h5 file. So probably there is where things go wrong. I suggest you to run them individually to see if you can reproduce the error, and use "which" command to make sure the commands are from smrtanalysis.

**xhuister** · 03-26-2016, 05:12 PM

Originally posted by bowhan View Post

Are all of your scripts from smrtanalysis? Have you manually installed any components from github? Did you invoke `smrtshell` before running anything?

No, I did not install other scripts from github. I only installed:
smrtanalysis_2.3.0.140936.run
smrtanalysis-patch_2.3.0.140936.p5.run

Sorry, I don't know what is " invoke `smrtshell`"? I installed SMRT analysis on a LSF computer cluster, and ran one command "pbtranscript.py cluster..." by submitting a bsub job.

Originally posted by bowhan View Post

If you want to manually re-run alignment + quiver to see where things go wrong, you can find the alignment command on the header section of the sam file (with `samtools view -SH`), and the rest in the script `/quivered/c42to449.sh`, which includes a series of commands "samtoh5", "loadPulses", "cmph5tools.py", "loadChemistry.py", and "quiver". Among them, "loadPulses" will greatly increase the sizes of the h5 file. So probably there is where things go wrong. I suggest you to run them individually to see if you can reproduce the error, and use "which" command to make sure the commands are from smrtanalysis.

In c42to449.sh:

Code:

samtoh5 /cluster/quivered/c42to449.sam /cluster/quivered/c42to449.ref.fa /cluster/quivered/c42to449.cmp.h5 -smrtTitle

gzip /cluster/quivered/c42to449.sam

loadPulses /ccs.input.fofn "/cluster/quivered/c42to449.cmp.h5" -byread -metrics QualityValue,InsertionQV,MergeQV,DeletionQV,DeletionTag,SubstitutionTag,SubstitutionQV

cmph5tools.py sort /cluster/quivered/c42to449.cmp.h5

samtools faidx /cluster/quivered/c42to449.ref.fa

loadChemistry.py /ccs.input.fofn /cluster/quivered/c42to449.cmp.h5

quiver /cluster/quivered/c42to449.cmp.h5 -v -j8 -r /cluster/quivered/c42to449.ref.fa -o /cluster/quivered/c42to449.quivered.fq

I got an information from loadPulses:
[INFO] 2016-03-27T08:59:44 [loadPulses] started.
' is not understood to be one of pls.h5, fasta, nor fastq.

When I ran loadPulses using .bas.h5 file instead of "ccs.input.fofn" which storted full paths of .bax.h5 files:
[INFO] 2016-03-27T09:02:30 [loadPulses] started.
loading 20714 alignments for movie 1
loading 20714 alignments for movie 1
loading 20714 alignments for movie 1
[INFO] 2016-03-27T09:08:13 [loadPulses] ended.

Then I got a "c42to449.cmp.h5" file of 268M from loadPulse. I really confused. The original cmp.h5 file is 40M. While this file from "pbalign --forQuiver" is 400M.

The quiver output fq files from loadPulse's cmp.h5 file and the pbalign's cmp.h5 file are not the same. But the sequences seem to be same ignoring upper or lower cases, the quality value is not the same.
For example:
One quiver sequence from loadPulse:
@c51|quiver
caaaccGGGCACTACGGTGAGACGTGAAAACA....
+
!!!!!!0?F=HEINPLQQRQQQQQRQHSSRRP....

The same sequence from pbalign:
@c51|quiver
CAAACCGGGCACTACGGTGAGACGTGAAAAC....
+
"QQQPRQRQQPQRSPQQQTSQRRQQPRTTUS....

How can I decide which cmp.h5 file is correct and to use?

Another problem is even though I can mannually run commands in c42to449.sh successfully. I infer that c42to449.sh is only one split chunk from the "pbtranscript.py cluster" or "ice_polish.py" command. I cannot run all other chunks by mannually running the .sh file (and the .sh file is not available for other chunks).

I guess the problem is how to assign the option "--bas_fofn" for ice_polish.py or "pbtranscript.py cluster". In my case, using .bas.h5 file or "ccs.input.fofn" file are not right.

**bowhan** · 03-28-2016, 10:38 AM

smrtshell erases your own environmental variables and loads only the ones related to smrtanalysis, thus ensures the correct scripts been called. It's located in the smrtanalysis directory, under the folder "smrtcmds/bin".

Can you please try to re-run another new job in a clean directory with command:

Code:

/path/to/smrtshell -c "pbtranscript.py rest_of_args..."

and see if there is any errors.

**xhuister** · 03-28-2016, 05:20 PM

Thank you, bowhan.

I happened to find the problem may be the ccs.input.fofn file, because I wrote this file in Windows OS, and the error message of pbtranscript.py was:
ERROR, file type '
.h5' is not understood to be one of pls.h5, fasta, nor fastq.

Sorry that in my post on #3 floor, I deleted the blank after ', the error message became ".h5", which may mislead you. I now used "ls *.bax.h5 > ccs.input.fofn" to rewrite a fofn file and am testing the script again (it may take few hours). If this still does not work, I'll try your way to re-run a new job to see what happens.

**bowhan** · 03-28-2016, 06:24 PM

Thanks for the update.
Please be sure to include the full path of the file, with probably `ls -1 $PWD/*bax.h5` or `ls *bax.h5 | xargs readlink -f`.

**xhuister** · 03-29-2016, 12:14 AM

Hi bowhan, just to let you know, finally, I got the quiver results. It is my mistake that I used wrong fofn file, which should not be created under Windows OS and blank line(s) are not allowed. (It may be OK to use dos2unix xx.fofn to convert the format, I haven't tested it yet).

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Error correction of PACbio using Illumina? When and how? no Ref. Genome

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News