Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PBcR hybrid assembly of 6.5MB genome with PacBio and MiSeq

    Hi all,

    I'm using PBcR to assemble PacBio and error correct with PE250 Illumina data. Before obtaining the Illumina data, our collaborators assembled the PacBio data by itself with HGAP3 into one contig at ~6.7MB. But with PacBio's high error rate, we wanted to correct with our Illumina data.

    We have 2056898 PE250 pairs. We had to aggressively quality trim due to a low quality run.

    We have 117765 PacBio reads with an average read length of ~4.5kb.

    When I use PBcR to correct and assemble, we end up with 310 contigs and a 4.5MB genome.

    The commands I ran were based on the website documentation.

    Code:
    ~/tools/wgs-8.3rc1/Linux-amd64/bin/PBcR -length 500 -partitions 200 -l NAME -s pacbio.spec -fastq PacBio.fastq genomeSize=6700000 illumina.frg
    And here is the reference code:

    Code:
    % cd sampleData/
    % <wgs>/<Linux-amd64>/bin/fastqToCA -libraryname illumina -technology illumina -type sanger -innie -reads illumina.fastq > illumina.frg
    % <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l lambdaIll -s pacbio.spec -fastq pacbio.filtered_subreads.fastq genomeSize=50000 illumina.frg

    Does anyone have an idea of what's going on here and how to improve this? Thanks in advance.

  • #2
    By your own admission the illumina data is not that good but what happens when you try to align illumina reads to original pacbio assembly? Can you provide some stats?

    Comment


    • #3
      We did aggressively quality trim the reads so I was hoping that it wouldn't bring down the quality of the PacBio-only assembly this much. PBcR does just use Illumina data for PacBio error correction correct? Would this indicate that the original PacBio-only assembly was not completely accurate? Or should the Illumina data, even aggressively quality-trimmed, just be ignored?

      Here are alignment results against the PacBio assembly:

      Code:
      2056898 reads; of these:
        2056898 (100.00%) were paired; of these:
          155844 (7.58%) aligned concordantly 0 times
          1826141 (88.78%) aligned concordantly exactly 1 time
          74913 (3.64%) aligned concordantly >1 times
          ----
          155844 pairs aligned concordantly 0 times; of these:
            32818 (21.06%) aligned discordantly 1 time
          ----
          123026 pairs aligned 0 times concordantly or discordantly; of these:
            246052 mates make up the pairs; of these:
              137416 (55.85%) aligned 0 times
              91214 (37.07%) aligned exactly 1 time
              17422 (7.08%) aligned >1 times
      96.66% overall alignment rate

      Comment


      • #4
        Originally posted by DrSpace View Post
        We did aggressively quality trim the reads so I was hoping that it wouldn't bring down the quality of the PacBio-only assembly this much. PBcR does just use Illumina data for PacBio error correction correct? Would this indicate that the original PacBio-only assembly was not completely accurate? Or should the Illumina data, even aggressively quality-trimmed, just be ignored?
        I was thinking that your PacBio assembly is of reasonably good quality. Wonder if the illumina data is actually causing a problem.

        Comment


        • #5
          I thought that, in general, one should not use an assembly that is just PacBio data due to its higher error rate. Maybe the self-correction steps used in most PacBio assemblies make up for this more than I'm imagining they do. Forgive my ignorance on the topic, I usually primarily work with Illumina data. I suppose I should try using PBcR with the PacBio data by itself and see what happens as well.

          Comment


          • #6
            HGAP handles all of that internally: https://github.com/PacificBioscience...-SMRT-Analysis

            I am not sure if PBcR would improve things but if you have the time you could try it.

            Comment


            • #7
              Originally posted by DrSpace View Post
              Before obtaining the Illumina data, our collaborators assembled the PacBio data by itself with HGAP3 into one contig at ~6.7MB.
              (this answers some of the previous posts).

              Originally posted by DrSpace View Post
              But with PacBio's high error rate, we wanted to correct with our Illumina data.
              You already seem to have a very good assembly. I recommend trying Pilon http://www.broadinstitute.org/software/pilon/ to polish your assembly and remove any leftover errors.

              Comment


              • #8
                Originally posted by flxlex View Post
                (this answers some of the previous posts).



                You already seem to have a very good assembly. I recommend trying Pilon http://www.broadinstitute.org/software/pilon/ to polish your assembly and remove any leftover errors.
                Thanks for the tip, I'll give that a try.

                Comment


                • #9
                  We recently came across this same issue. The "PacBio only" assembly turned out to be far superior to any hybrid scheme (using either SPAdes or PBcR).

                  We tried using Pilon to polish the PacBio assemblies and got some interesting results. Looking at the Pilon *.changes files, there are lots of G/C single insertions (from the MiSeq perspective relative to PacBio). Is this a known behavior? Forgive me, this is my first experience w/ PacBio data.

                  Thanks,
                  Fan
                  Attached Files

                  Comment


                  • #10
                    Pacbio gives some excellent assemblies as you all have seen. Actually I just use Pacbio data alone in PBcR and it gives excellent results - better -in terms of less contigs- than HGAP3 in many cases.

                    You could try rerunning quiver on the assembly to correct any further errors - there are good docs on this on the Pacbio site.

                    If you insist upon using the Illumina data (for correction, not for assembly) why not just align classically and call SNPs / indel differences ? Then eyeball the differences.

                    Pilon is also a good choice.

                    If this is a bacterium you can also do a draft annotation and check for frequent frameshifts caused by indels.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    68 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X