Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to create CCS from subreads without smrtcell data?

    Hi everyone,

    I have a small problem at the moment.

    From our collaborators I got
    - 1 assembled genome
    - the related PacBio subreads

    I did NOT get the smrtcell data (I could ask, but well...not if I can get around it in an easy way).

    For my genome submission, I now need to calculate the coverage of my genome. If I map the subreads to the genome, this will not give an accurate result. Therefore I'd like to create the CCS from the subreads.
    I'd assume that the protocol "RS_subreads.1" in smrtportal would maybe do something like this. But I cannot even test that, because I cannnot import the subreads, because I don't have the related smrtcell data.

    Does anyone have maybe any idea how I could solve this without handling the smrtcell data?

  • #2
    https://github.com/PacificBiosciences/pbdagcon may be able to do this but you will have to generate blasr alignments for your reads.

    It should be a trivial (relatively) task for the sequence provider to run the "RS_ReadsOfInsert" protocol on SMRTcells and generate the data you need. Try asking them.

    Comment


    • #3
      Thanks !

      The collaborators are a bit complicated, therefore I'd try to get around them.

      BLASR alignments are not a problem.
      Installing pbdagcon is though.
      Somehow the organization of the folders seems to be messed up, it doesn't find the right header/cpp files at the right location (most likely related to the fact that some folders are not included in the default git download, or in the clone), and just doesn't compile. I've messed around with the file locations for some time, edited in another compiler flag (because it was complaining about some conversion), but...no...I don't get there.

      Maybe I'm doing something wrong though.
      Just make in the home directory of the download doesn't do anything, and in the cpp directory the problems begin.

      Has anyone tested if the download compiles on another machine?

      Comment


      • #4
        Originally posted by bastianwur View Post
        Has anyone tested if the download compiles on another machine?
        You can try download again https://github.com/PacificBiosciences/pbdagcon since it just got updated.

        Comment


        • #5
          Originally posted by bastianwur View Post
          Hi everyone,

          I have a small problem at the moment.

          From our collaborators I got
          - 1 assembled genome
          - the related PacBio subreads

          I did NOT get the smrtcell data (I could ask, but well...not if I can get around it in an easy way).

          For my genome submission, I now need to calculate the coverage of my genome. If I map the subreads to the genome, this will not give an accurate result. Therefore I'd like to create the CCS from the subreads.
          I'd assume that the protocol "RS_subreads.1" in smrtportal would maybe do something like this. But I cannot even test that, because I cannnot import the subreads, because I don't have the related smrtcell data.

          Does anyone have maybe any idea how I could solve this without handling the smrtcell data?
          If you want high quality CCS, you need to start from SMRTCell data. Without the SMRTCell data, you are running the consensus calling algorithm quiver without the necessary quality value data (InsertionQV, DeletionQV, SubstitutionQV, and MergeQV) to generate highly accurate consensus calls.

          Comment


          • #6
            If I am correct in understanding that you want the coverage of single molecules (inserts rather than the subread coverage), why don't you just select the longest subread from each read and map those against the genome. The accuracy gain from calculating consensus of the subreads from one insert (either using pbdagcon or CCS) will not result in any significant difference in the mapping, and the consensus is best calculated from all the subreads using quiver.

            Comment


            • #7
              Originally posted by mjhsieh View Post
              You can try download again https://github.com/PacificBiosciences/pbdagcon since it just got updated.
              Thanks, it builds now .

              Originally posted by gconcepcion View Post
              If you want high quality CCS, you need to start from SMRTCell data. Without the SMRTCell data, you are running the consensus calling algorithm quiver without the necessary quality value data (InsertionQV, DeletionQV, SubstitutionQV, and MergeQV) to generate highly accurate consensus calls.
              mmhh....okay, will consider that, if I don't get good enough results.

              Originally posted by rhall View Post
              If I am correct in understanding that you want the coverage of single molecules (inserts rather than the subread coverage), why don't you just select the longest subread from each read and map those against the genome. The accuracy gain from calculating consensus of the subreads from one insert (either using pbdagcon or CCS) will not result in any significant difference in the mapping, and the consensus is best calculated from all the subreads using quiver.
              That...actually makes sense, thanks.
              Maybe I'll see if it makes a difference.

              Comment


              • #8
                Originally posted by rhall View Post
                If I am correct in understanding that you want the coverage of single molecules (inserts rather than the subread coverage), why don't you just select the longest subread from each read and map those against the genome.
                And, the read names will tell you which reads are subreads of the same ZMW ('well'). See https://github.com/PacificBioscience...#readexplained, scroll a bit down to the part that says

                Code:
                <movieName>/<ZMW number>/<subread start_subread end>

                Comment


                • #9
                  Thanks, I've already digged through the PacBio website .

                  Originally posted by GenoMax View Post
                  https://github.com/PacificBiosciences/pbdagcon may be able to do this but you will have to generate blasr alignments for your reads.
                  Originally posted by bastianwur View Post
                  BLASR alignments are not a problem.
                  I might have been to fast with this ^^.
                  What do I exactly need to map to what?
                  Right now it seems that I'd need for every subread a separate alignment file...or am I wrong? Can do that, but would rather get around that.
                  (computer scientists are lazy people, right ^^?)


                  Unrelated: Tremendous difference between bowtie2 + blasr alignments for the longest subread.
                  First one maps 5%, the second maps 50%.
                  (library highly contaminated with e.coli + vectors, roughly up to 50%, so that fits)

                  Comment


                  • #10
                    Just make a hash table for each read's name, and store a representative subread of each read in it. Then dump all of it to a single fasta/fastq file, and map that, so you get one sam file from which you can calculate coverage.

                    bowtie2 is not designed for high error rates; no point in using that with raw PacBio data.

                    Comment


                    • #11
                      Originally posted by Brian Bushnell View Post
                      Just make a hash table for each read's name, and store a representative subread of each read in it. Then dump all of it to a single fasta/fastq file, and map that, so you get one sam file from which you can calculate coverage.
                      Did that to get the above values .

                      But yeah, I guess I'll stay with that, since I'm running out of time.

                      Thanks .

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X