SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
Very basic search for sequence fragments within CCS fasta files hammers13 Pacific Biosciences 2 11-19-2014 09:08 AM
Read quality calculation for a base in CCS curious.genome Pacific Biosciences 1 11-13-2013 09:29 AM
PacBio CCS vs subreads explained ? curious.genome Pacific Biosciences 3 10-24-2013 10:39 PM
how to filter CCS by number of passes (not by long read length)? metheuse Pacific Biosciences 1 08-29-2013 12:27 PM
Ignore CCS reads - a correct assumption? ritzriya Pacific Biosciences 2 03-27-2012 09:36 PM

Reply
 
Thread Tools
Old 09-05-2014, 01:32 AM   #1
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default How to create CCS from subreads without smrtcell data?

Hi everyone,

I have a small problem at the moment.

From our collaborators I got
- 1 assembled genome
- the related PacBio subreads

I did NOT get the smrtcell data (I could ask, but well...not if I can get around it in an easy way).

For my genome submission, I now need to calculate the coverage of my genome. If I map the subreads to the genome, this will not give an accurate result. Therefore I'd like to create the CCS from the subreads.
I'd assume that the protocol "RS_subreads.1" in smrtportal would maybe do something like this. But I cannot even test that, because I cannnot import the subreads, because I don't have the related smrtcell data.

Does anyone have maybe any idea how I could solve this without handling the smrtcell data?
bastianwur is offline   Reply With Quote
Old 09-05-2014, 03:22 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

https://github.com/PacificBiosciences/pbdagcon may be able to do this but you will have to generate blasr alignments for your reads.

It should be a trivial (relatively) task for the sequence provider to run the "RS_ReadsOfInsert" protocol on SMRTcells and generate the data you need. Try asking them.
GenoMax is offline   Reply With Quote
Old 09-05-2014, 07:33 AM   #3
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

Thanks !

The collaborators are a bit complicated, therefore I'd try to get around them.

BLASR alignments are not a problem.
Installing pbdagcon is though.
Somehow the organization of the folders seems to be messed up, it doesn't find the right header/cpp files at the right location (most likely related to the fact that some folders are not included in the default git download, or in the clone), and just doesn't compile. I've messed around with the file locations for some time, edited in another compiler flag (because it was complaining about some conversion), but...no...I don't get there.

Maybe I'm doing something wrong though.
Just make in the home directory of the download doesn't do anything, and in the cpp directory the problems begin.

Has anyone tested if the download compiles on another machine?
bastianwur is offline   Reply With Quote
Old 09-05-2014, 12:26 PM   #4
mjhsieh
Member
 
Location: USA

Join Date: Jan 2013
Posts: 10
Default

Quote:
Originally Posted by bastianwur View Post
Has anyone tested if the download compiles on another machine?
You can try download again https://github.com/PacificBiosciences/pbdagcon since it just got updated.
mjhsieh is offline   Reply With Quote
Old 09-05-2014, 12:49 PM   #5
gconcepcion
Member
 
Location: Menlo Park

Join Date: Dec 2010
Posts: 68
Default

Quote:
Originally Posted by bastianwur View Post
Hi everyone,

I have a small problem at the moment.

From our collaborators I got
- 1 assembled genome
- the related PacBio subreads

I did NOT get the smrtcell data (I could ask, but well...not if I can get around it in an easy way).

For my genome submission, I now need to calculate the coverage of my genome. If I map the subreads to the genome, this will not give an accurate result. Therefore I'd like to create the CCS from the subreads.
I'd assume that the protocol "RS_subreads.1" in smrtportal would maybe do something like this. But I cannot even test that, because I cannnot import the subreads, because I don't have the related smrtcell data.

Does anyone have maybe any idea how I could solve this without handling the smrtcell data?
If you want high quality CCS, you need to start from SMRTCell data. Without the SMRTCell data, you are running the consensus calling algorithm quiver without the necessary quality value data (InsertionQV, DeletionQV, SubstitutionQV, and MergeQV) to generate highly accurate consensus calls.
gconcepcion is offline   Reply With Quote
Old 09-05-2014, 01:30 PM   #6
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

If I am correct in understanding that you want the coverage of single molecules (inserts rather than the subread coverage), why don't you just select the longest subread from each read and map those against the genome. The accuracy gain from calculating consensus of the subreads from one insert (either using pbdagcon or CCS) will not result in any significant difference in the mapping, and the consensus is best calculated from all the subreads using quiver.
rhall is offline   Reply With Quote
Old 09-07-2014, 11:38 PM   #7
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

Quote:
Originally Posted by mjhsieh View Post
You can try download again https://github.com/PacificBiosciences/pbdagcon since it just got updated.
Thanks, it builds now .

Quote:
Originally Posted by gconcepcion View Post
If you want high quality CCS, you need to start from SMRTCell data. Without the SMRTCell data, you are running the consensus calling algorithm quiver without the necessary quality value data (InsertionQV, DeletionQV, SubstitutionQV, and MergeQV) to generate highly accurate consensus calls.
mmhh....okay, will consider that, if I don't get good enough results.

Quote:
Originally Posted by rhall View Post
If I am correct in understanding that you want the coverage of single molecules (inserts rather than the subread coverage), why don't you just select the longest subread from each read and map those against the genome. The accuracy gain from calculating consensus of the subreads from one insert (either using pbdagcon or CCS) will not result in any significant difference in the mapping, and the consensus is best calculated from all the subreads using quiver.
That...actually makes sense, thanks.
Maybe I'll see if it makes a difference.
bastianwur is offline   Reply With Quote
Old 09-07-2014, 11:41 PM   #8
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

Quote:
Originally Posted by rhall View Post
If I am correct in understanding that you want the coverage of single molecules (inserts rather than the subread coverage), why don't you just select the longest subread from each read and map those against the genome.
And, the read names will tell you which reads are subreads of the same ZMW ('well'). See https://github.com/PacificBioscience...#readexplained, scroll a bit down to the part that says

Code:
<movieName>/<ZMW number>/<subread start_subread end>
flxlex is offline   Reply With Quote
Old 09-14-2014, 11:49 PM   #9
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

Thanks, I've already digged through the PacBio website .

Quote:
Originally Posted by GenoMax View Post
https://github.com/PacificBiosciences/pbdagcon may be able to do this but you will have to generate blasr alignments for your reads.
Quote:
Originally Posted by bastianwur View Post
BLASR alignments are not a problem.
I might have been to fast with this ^^.
What do I exactly need to map to what?
Right now it seems that I'd need for every subread a separate alignment file...or am I wrong? Can do that, but would rather get around that.
(computer scientists are lazy people, right ^^?)


Unrelated: Tremendous difference between bowtie2 + blasr alignments for the longest subread.
First one maps 5%, the second maps 50%.
(library highly contaminated with e.coli + vectors, roughly up to 50%, so that fits)
bastianwur is offline   Reply With Quote
Old 09-15-2014, 08:33 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Just make a hash table for each read's name, and store a representative subread of each read in it. Then dump all of it to a single fasta/fastq file, and map that, so you get one sam file from which you can calculate coverage.

bowtie2 is not designed for high error rates; no point in using that with raw PacBio data.
Brian Bushnell is offline   Reply With Quote
Old 09-22-2014, 01:37 AM   #11
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

Quote:
Originally Posted by Brian Bushnell View Post
Just make a hash table for each read's name, and store a representative subread of each read in it. Then dump all of it to a single fasta/fastq file, and map that, so you get one sam file from which you can calculate coverage.
Did that to get the above values .

But yeah, I guess I'll stay with that, since I'm running out of time.

Thanks .
bastianwur is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:01 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO