Hi all,
We've sequenced the transcriptom of a species (genome is about 3GB), but NO reference genome is available, using PACbio and Illumina.
For PACbio, I've got 6 zip file, each with 3 .bax.h5 files in it. (This means I've got 6 SMRT cells, right?)
librarysize<1kb: 1 file
1~2kb: 2 files
2~3kb: 2 files
>3kb: 1 file
I'm planning to get the full length transcripts from PACbio data and conduct expression analysis using Illumina data.
Following the RS_IsoSeq tutorial, I will follow these steps:
step 1) get read of inserts using ConsensusTools.sh for each zip file
step 2) get full length reads using pbtranscript.py classify in RS_IsoSeq
step 3) merge all output from step (2), and do ICE and QV using pbtranscript.py cluster in RS_IsoSeq (this may take long time)
step 4) remove redundant transcripts using tofu-scripts.
I learned from many literatures that subreads were corrected by NGS short reads. But this error correction step would require huge computer resources and time.
Is it neccesary to do NGS-based error correction for pacbio data if I have done ICE and QV? But if I go for it, at which step should I do correction, on read of inserts (after step 1), or on full length reads (after step 2) or after ICE and QV (step 3)?
What would you recommend for error correction? I know there are some tools, like LSR, prooveread, PacBioToCA, but I don't know which one runs faster and needs smaller computer resource.
When I get full length transcripts from PACbio data, is it sufficient to do expression analysis by simply mapping NGS data to these transcripts (because no ref. genome is available)? Or should I perform assembling on PACbio and NGS data? -- but it seems that I cannot do assembly if I only have transcriptome data, right?
Thank you very much!
We've sequenced the transcriptom of a species (genome is about 3GB), but NO reference genome is available, using PACbio and Illumina.
For PACbio, I've got 6 zip file, each with 3 .bax.h5 files in it. (This means I've got 6 SMRT cells, right?)
librarysize<1kb: 1 file
1~2kb: 2 files
2~3kb: 2 files
>3kb: 1 file
I'm planning to get the full length transcripts from PACbio data and conduct expression analysis using Illumina data.
Following the RS_IsoSeq tutorial, I will follow these steps:
step 1) get read of inserts using ConsensusTools.sh for each zip file
step 2) get full length reads using pbtranscript.py classify in RS_IsoSeq
step 3) merge all output from step (2), and do ICE and QV using pbtranscript.py cluster in RS_IsoSeq (this may take long time)
step 4) remove redundant transcripts using tofu-scripts.
I learned from many literatures that subreads were corrected by NGS short reads. But this error correction step would require huge computer resources and time.
Is it neccesary to do NGS-based error correction for pacbio data if I have done ICE and QV? But if I go for it, at which step should I do correction, on read of inserts (after step 1), or on full length reads (after step 2) or after ICE and QV (step 3)?
What would you recommend for error correction? I know there are some tools, like LSR, prooveread, PacBioToCA, but I don't know which one runs faster and needs smaller computer resource.
When I get full length transcripts from PACbio data, is it sufficient to do expression analysis by simply mapping NGS data to these transcripts (because no ref. genome is available)? Or should I perform assembling on PACbio and NGS data? -- but it seems that I cannot do assembly if I only have transcriptome data, right?
Thank you very much!
Comment