Seqanswers Leaderboard Ad

**Brian Bushnell** · 05-03-2017, 12:23 PM

I would ALWAYS run consensus and use only reads of insert reads rather than filtered subreads, for any purpose.

**rhall** · 05-04-2017, 12:53 AM

There is no need to run circular consensus for denovo genome assembly or scaffolding, the power in the data is in the read length, not in the base accuracy, and calculating CCS drastically reduces the yield of long reads. For confident gap filling the gap should be spanned by multiple subreads from different ZMWs, if all the subreads spanning a gap are from the same ZMW (molecule) then there is a chance it is a biological chimera and not a real closure.

**Brian Bushnell** · 05-04-2017, 09:22 AM

Self-consensus will make the mapping faster and more accurate, and simplify the problem of detecting subreads that came from same molecule (since they won't exist anymore). Long reads will not be lost in the Reads Of Insert protocol set to require no minimum number of passes.

**cstack** · 05-04-2017, 02:28 PM

Originally posted by Brian Bushnell View Post

I would ALWAYS run consensus and use only reads of insert reads rather than filtered subreads, for any purpose.

Thanks much for the reply. When you run CircularConsensus, what quality threshold (or paramaters) do you typically use?

**cstack** · 05-04-2017, 02:48 PM

Originally posted by rhall View Post

There is no need to run circular consensus for denovo genome assembly or scaffolding, the power in the data is in the read length, not in the base accuracy, and calculating CCS drastically reduces the yield of long reads.

Thanks much for the reply. I tried running CircularConsensus on a single bax.h5 file, and I see what you mean about reduced read lengths.

Originally posted by rhall View Post

For confident gap filling the gap should be spanned by multiple subreads from different ZMWs, if all the subreads spanning a gap are from the same ZMW (molecule) then there is a chance it is a biological chimera and not a real closure.

I'll have a look through the PBjelly files and check how many ZMW/subreads support each gap-filling.

As a follow-up, if a gap is filled in part by a particular subread from a particular ZMW should I expect to see all other subreads from that ZMW also fill in the gap?

It seems reasonable to expect that they would, at least in theory. But I was looking in to a case where one of my gaps was filled by 2 full-length subreads out of the 7 full-length subreads (9 total subreads). For this particular ZMW, the subreads had a relatively low quality (RQ=0.78), and some of the full-length subreads had substantially different lengths (median 7000bp, min 3000bp, max 9000bp). Are there other factors that you'd recommend I check when I evaluate the subreads filling a gap?

**Brian Bushnell** · 05-04-2017, 03:12 PM

I don't personally run any PacBio software, I just deal with the files downstream, so I don't know the exact flags. But basically, I suggest you run with 0 required passes and a minimum quality of 0, so you don't discard anything. Do the longest reads still disappear in that scenario? According to what I have been told by PacBio, they shouldn't...

**cstack** · 05-05-2017, 09:02 AM

Originally posted by Brian Bushnell View Post

I don't personally run any PacBio software, I just deal with the files downstream, so I don't know the exact flags. But basically, I suggest you run with 0 required passes and a minimum quality of 0, so you don't discard anything. Do the longest reads still disappear in that scenario? According to what I have been told by PacBio, they shouldn't...

I ran CircularConsensus (ConsensusTools v2.3.0.149240) using these options:

Code:

ConsensusTools.sh CircularConsensus -n 16 \
--logFile=test_ccs.log \
--minFullPasses 0 --minPredictedAccuracy 0 \
m160611_100724_42219_c101002732550000001823227509161692_s1_p0.1.bax.h5

And after it finished, this is what was in the log file:

Code:

# 01:00:18 [CircularConsensus] Result Report for the 54494 Zmws processed
# Zmw Result                                            #-Zmws     %-Zmws
# Successful - Quiver consensus found                   8916       16.36 %
# Successful - But only 1 region, no true consensus     16500      30.28 %
# Failed - Exception thrown                             0          0.00 %
# Failed - ZMW was not productive                       28289      51.91 %
# Failed - Outside of SNR ranges                        753        1.38 %
# Failed - No insert regions found                      0          0.00 %
# Failed - Not enough full passes                       0          0.00 %
# Failed - Insert length too small                      0          0.00 %
# Failed - Post POA requirements not met                0          0.00 %
# Failed - CCS Read below predicted accuracy            0          0.00 %
# Failed - CCS Read was palindrome                      36         0.07 %
# Failed - CCS Read below SNR threshold                 0          0.00 %
# Failed - CCS Read too short or long                   0          0.00 %

Looking at the distribution of sequence lengths of the subreads vs the ccs reads, it seems like all of the largest sequences were preserved (see pics attached). So, thanks for the suggestion!

The only odd thing from this small experiment was that the number of ZMWs represented in the CCS file (N=25338) was smaller than what is indicated by the output log (N_successful= 8916 + 16500 = 254160). So, there might be some additional filtering step (maybe based on length?) before the reads are actually output to file.

Attached Files

**rhall** · 05-08-2017, 09:02 AM

As a follow-up, if a gap is filled in part by a particular subread from a particular ZMW should I expect to see all other subreads from that ZMW also fill in the gap?

There will be significant noise in this analysis and I wouldn't expect all subreads to perform the same with regards to the alignment.

Running ROI with 0 passes and minimum quality of 0 is not a recommended protocol and I'm unsure of any use case for this kind of data. CCS data should only be used when a high single molecule accuracy is needed (minor variant detection, 16S, pseudo-gene differentiation). In these cases I also would not recommend using the old ROI pipeline, the new CCS2 algorithm will give much better results.

**Brian Bushnell** · 05-08-2017, 10:14 AM

Originally posted by rhall View Post

Running ROI with 0 passes and minimum quality of 0 is not a recommended protocol and I'm unsure of any use case for this kind of data.

It reduces your volume of data to give fewer, higher-quality short sequences, while keeping the long sequences. This reduces computational requirements and increases accuracy of alignment for any given read, while removing complications due to copies of chimeric molecules being presented as independent. What's not to like? I'd prefer that data for any use-case.

In these cases I also would not recommend using the old ROI pipeline, the new CCS2 algorithm will give much better results.

I'm not aware of that; I'll have to look into it.

**cstack** · 05-10-2017, 10:45 AM

Originally posted by rhall View Post

There will be significant noise in this analysis and I wouldn't expect all subreads to perform the same with regards to the alignment.

Running ROI with 0 passes and minimum quality of 0 is not a recommended protocol and I'm unsure of any use case for this kind of data. CCS data should only be used when a high single molecule accuracy is needed (minor variant detection, 16S, pseudo-gene differentiation). In these cases I also would not recommend using the old ROI pipeline, the new CCS2 algorithm will give much better results.

Thanks for the info. In the future, I'll check out the new consensus algorithm for future work (presumably this is the pbccs / unanimity module).

I do find it appealing to use CCS reads rather than the raw subreads mostly because the HPC I have access to is relatively old and slow. Also, after I scaffold and gap-fill with the raw subreads I realized I am left with a genome that has spans where pacbio was used that have a higher error rate than the surrounding regions. I should have done some pre-correction of the subreads beforehand perhaps, but I didn't and rerunning the process using CCS reads was one option I'd been exploring.

Again, many thanks for your explanations.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Is there any reason not to run RS_ReadsOfInsert?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News