SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
Reason for max read length? trauba 454 Pyrosequencing 2 05-05-2016 09:05 AM
what's the reason for RPKM to be -1? xuenjun1 RNA Sequencing 5 10-10-2014 10:02 AM
Tophat halts for no apparent reason CompBio Bioinformatics 8 05-13-2014 10:59 PM
reason for low mapping rate?? miaom RNA Sequencing 3 05-10-2014 08:25 AM
any reason to compile GATK from source? Kotoro Bioinformatics 2 04-22-2012 04:36 PM

Reply
 
Thread Tools
Old 05-03-2017, 12:31 PM   #1
cstack
Member
 
Location: Florida, US

Join Date: May 2017
Posts: 12
Default Is there any reason not to run RS_ReadsOfInsert?

I've included some background info after the questions, which are first in cases of TL;DR

Questions:
1. Is it generally a best-practice to run CircularConsensus on SMRT cell DNAseq data before doing an analysis such as scaffolding or genome assembly?

2. Under what circumstances would you not run CircularConsensus?


Background:
Our collaborators sent us 9.4Gbps (12 SMRT cells) of plant DNA sequencing from RSII (P6C4, I think). We estimate this represents ~20x coverage of the plant's genome. All of the initial processing was done by the collaborators. Their last step was the filtering of subreads, which have a post-filter N50 of around 8,000.

My Goal:
I am trying to use the reads to gap-fill and do additional scaffolding of a draft genome assembly.

Results:
I ran PBjelly using uncorrected subreads providing the different Analysis_results directories for each SMRT cell. The results seem very good - about half of the gaps were filled and the scaffold N50 increased by 20%.

But I suspect that some of the filled gaps, especially those in repetitive areas, are not correct. When I look at the subread placement over each gap (produced by PBjelly), I noticed that some were filled, for example, by a minor proportion (N=2) the total subreads (N=9) from a ZMW. There were a few instances like this. It occurred to me that maybe it was a mistake to use the subreads rather than consensus sequences.
cstack is offline   Reply With Quote
Old 05-03-2017, 01:23 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,696
Default

I would ALWAYS run consensus and use only reads of insert reads rather than filtered subreads, for any purpose.
Brian Bushnell is offline   Reply With Quote
Old 05-04-2017, 01:53 AM   #3
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 314
Default

There is no need to run circular consensus for denovo genome assembly or scaffolding, the power in the data is in the read length, not in the base accuracy, and calculating CCS drastically reduces the yield of long reads. For confident gap filling the gap should be spanned by multiple subreads from different ZMWs, if all the subreads spanning a gap are from the same ZMW (molecule) then there is a chance it is a biological chimera and not a real closure.
rhall is offline   Reply With Quote
Old 05-04-2017, 10:22 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,696
Default

Self-consensus will make the mapping faster and more accurate, and simplify the problem of detecting subreads that came from same molecule (since they won't exist anymore). Long reads will not be lost in the Reads Of Insert protocol set to require no minimum number of passes.
Brian Bushnell is offline   Reply With Quote
Old 05-04-2017, 03:28 PM   #5
cstack
Member
 
Location: Florida, US

Join Date: May 2017
Posts: 12
Default

Quote:
Originally Posted by Brian Bushnell View Post
I would ALWAYS run consensus and use only reads of insert reads rather than filtered subreads, for any purpose.
Thanks much for the reply. When you run CircularConsensus, what quality threshold (or paramaters) do you typically use?
cstack is offline   Reply With Quote
Old 05-04-2017, 03:48 PM   #6
cstack
Member
 
Location: Florida, US

Join Date: May 2017
Posts: 12
Default

Quote:
Originally Posted by rhall View Post
There is no need to run circular consensus for denovo genome assembly or scaffolding, the power in the data is in the read length, not in the base accuracy, and calculating CCS drastically reduces the yield of long reads.
Thanks much for the reply. I tried running CircularConsensus on a single bax.h5 file, and I see what you mean about reduced read lengths.

Quote:
Originally Posted by rhall View Post
For confident gap filling the gap should be spanned by multiple subreads from different ZMWs, if all the subreads spanning a gap are from the same ZMW (molecule) then there is a chance it is a biological chimera and not a real closure.
I'll have a look through the PBjelly files and check how many ZMW/subreads support each gap-filling.

As a follow-up, if a gap is filled in part by a particular subread from a particular ZMW should I expect to see all other subreads from that ZMW also fill in the gap?

It seems reasonable to expect that they would, at least in theory. But I was looking in to a case where one of my gaps was filled by 2 full-length subreads out of the 7 full-length subreads (9 total subreads). For this particular ZMW, the subreads had a relatively low quality (RQ=0.78), and some of the full-length subreads had substantially different lengths (median 7000bp, min 3000bp, max 9000bp). Are there other factors that you'd recommend I check when I evaluate the subreads filling a gap?
cstack is offline   Reply With Quote
Old 05-04-2017, 04:12 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,696
Default

I don't personally run any PacBio software, I just deal with the files downstream, so I don't know the exact flags. But basically, I suggest you run with 0 required passes and a minimum quality of 0, so you don't discard anything. Do the longest reads still disappear in that scenario? According to what I have been told by PacBio, they shouldn't...
Brian Bushnell is offline   Reply With Quote
Old 05-05-2017, 10:02 AM   #8
cstack
Member
 
Location: Florida, US

Join Date: May 2017
Posts: 12
Default

Quote:
Originally Posted by Brian Bushnell View Post
I don't personally run any PacBio software, I just deal with the files downstream, so I don't know the exact flags. But basically, I suggest you run with 0 required passes and a minimum quality of 0, so you don't discard anything. Do the longest reads still disappear in that scenario? According to what I have been told by PacBio, they shouldn't...
I ran CircularConsensus (ConsensusTools v2.3.0.149240) using these options:

Code:
ConsensusTools.sh CircularConsensus -n 16 \
--logFile=test_ccs.log \
--minFullPasses 0 --minPredictedAccuracy 0 \
m160611_100724_42219_c101002732550000001823227509161692_s1_p0.1.bax.h5
And after it finished, this is what was in the log file:

Code:
# 01:00:18 [CircularConsensus] Result Report for the 54494 Zmws processed
# Zmw Result                                            #-Zmws     %-Zmws
# Successful - Quiver consensus found                   8916       16.36 %
# Successful - But only 1 region, no true consensus     16500      30.28 %
# Failed - Exception thrown                             0          0.00 %
# Failed - ZMW was not productive                       28289      51.91 %
# Failed - Outside of SNR ranges                        753        1.38 %
# Failed - No insert regions found                      0          0.00 %
# Failed - Not enough full passes                       0          0.00 %
# Failed - Insert length too small                      0          0.00 %
# Failed - Post POA requirements not met                0          0.00 %
# Failed - CCS Read below predicted accuracy            0          0.00 %
# Failed - CCS Read was palindrome                      36         0.07 %
# Failed - CCS Read below SNR threshold                 0          0.00 %
# Failed - CCS Read too short or long                   0          0.00 %
Looking at the distribution of sequence lengths of the subreads vs the ccs reads, it seems like all of the largest sequences were preserved (see pics attached). So, thanks for the suggestion!

The only odd thing from this small experiment was that the number of ZMWs represented in the CCS file (N=25338) was smaller than what is indicated by the output log (N_successful= 8916 + 16500 = 254160). So, there might be some additional filtering step (maybe based on length?) before the reads are actually output to file.
Attached Images
File Type: png subreads.png (27.4 KB, 4 views)
File Type: png ccs.png (23.6 KB, 4 views)
cstack is offline   Reply With Quote
Old 05-08-2017, 10:02 AM   #9
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 314
Default

Quote:
As a follow-up, if a gap is filled in part by a particular subread from a particular ZMW should I expect to see all other subreads from that ZMW also fill in the gap?
There will be significant noise in this analysis and I wouldn't expect all subreads to perform the same with regards to the alignment.

Running ROI with 0 passes and minimum quality of 0 is not a recommended protocol and I'm unsure of any use case for this kind of data. CCS data should only be used when a high single molecule accuracy is needed (minor variant detection, 16S, pseudo-gene differentiation). In these cases I also would not recommend using the old ROI pipeline, the new CCS2 algorithm will give much better results.
rhall is offline   Reply With Quote
Old 05-08-2017, 11:14 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,696
Default

Quote:
Originally Posted by rhall View Post
Running ROI with 0 passes and minimum quality of 0 is not a recommended protocol and I'm unsure of any use case for this kind of data.
It reduces your volume of data to give fewer, higher-quality short sequences, while keeping the long sequences. This reduces computational requirements and increases accuracy of alignment for any given read, while removing complications due to copies of chimeric molecules being presented as independent. What's not to like? I'd prefer that data for any use-case.

Quote:
In these cases I also would not recommend using the old ROI pipeline, the new CCS2 algorithm will give much better results.
I'm not aware of that; I'll have to look into it.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:19 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO