Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • k-gun12
    Member
    • Feb 2010
    • 56

    Systemic problem with PacBio data and chimeric contigs

    I've got ~30x coverage of a small < 100MB algal genome using PB RSII. I corrected, assembled and polished the genome with Canu, and was pretty pleased with the results until I blasted the genome into itself and found dozens and dozens of repeated DNA regions up to and gt 50kbp that occur in multiple contigs - usually at the ends but not always. The Canu developers helped tweak my run a bit, but the problem persisted. Recently I used the same workflow with a different alga and see the exact same problem, and have recently spoken to another lab (working on corals) with identical issues using SMRTmake (not sure if it was HGAP.3 or not). It has gotten so bad that I've found chloroplast fragments assembled in with the genomic DNA contigs. Has anyone else encountered this? My runs were all done on different instruments with different extraction protocols.. is the RSII creating chimeric reads? Thanks in advance.
  • gconcepcion
    Member
    • Dec 2010
    • 68

    #2
    One way to circumvent this is with the overlap_filtering_setting in FALCON. This allows you to filter out "chimeric contigs" due to the fact that overlap coverage will differ across the contig. The coverage in repetitive regions will be much higher relative to everything else.

    I'm not aware of a similar setting in canu
    Last edited by gconcepcion; 11-15-2016, 11:58 AM. Reason: clarity

    Comment

    • rhall
      Senior Member
      • Aug 2012
      • 324

      #3
      There is always a non zero chance or creating biological chimeras in sample prep, adapters are blunt end ligated to the sheared DNA therefore it is always possible that fragments ligate to one another before having adapters attached. Obviously the adapter concentration is optimized to minimize this and in general biological chimeras are extremely rare, but it is possible that mistakes in sample prep can results in much higher numbers.
      Even if biological chimers do occur they are random so should not have support from other reads i.e. the first step of assembly corrects them. But in cases of bad sample prep it is possible that chimeras, due to their large number, pass correction and result in miss-assemblies. As pointed out in the above post preassembly can be parameterized to better handle high levels of biological chimeras, higher coverage requirement for correction, not using multiple subreads from the same molecule (not using -a in Falcon), but this will depend on the extent of the problem and assembler being used.

      Comment

      • k-gun12
        Member
        • Feb 2010
        • 56

        #4
        Thanks.. I have not yet tried Falcon. Maybe it's worth a shot. I think heterozygosity is a real problem for pacbio and I'm wondering if it is causing some of my issues. My samples are multiisolates and have not spent years in culture that would breed out variation. I dug up this thread:

        Hi everyone ! I'm trying to use Canu in order to assemble the D. suzukii genome. As flies genome are genes dense (genes are very close to each others), and as the D. suzukii species contains a lot ...


        That seems to mirror my issues as well. When I noticed this problem, my first thoughts were "this can't apply only to me" since it was present in every assembly we've made using RSII data regardless of covearge, but perhaps most other folks are using clonal lines or inbred populations.

        Comment

        • gconcepcion
          Member
          • Dec 2010
          • 68

          #5
          Originally posted by k-gun12 View Post
          Thanks.. I have not yet tried Falcon. Maybe it's worth a shot. I think heterozygosity is a real problem for pacbio and I'm wondering if it is causing some of my issues. My samples are multiisolates and have not spent years in culture that would breed out variation. I dug up this thread:

          Hi everyone ! I'm trying to use Canu in order to assemble the D. suzukii genome. As flies genome are genes dense (genes are very close to each others), and as the D. suzukii species contains a lot ...


          That seems to mirror my issues as well. When I noticed this problem, my first thoughts were "this can't apply only to me" since it was present in every assembly we've made using RSII data regardless of covearge, but perhaps most other folks are using clonal lines or inbred populations.
          Heterozygosity is a real issue for assembling data from any technology, not just pacbio. this is likely to be an issue with any multi-isolate algal culture. The best way for algae is to do single-cell isolates and subsequently grow into a clonal culture. I spent alot of time as an undergrad and postdoc doing single cell algal isolates. Not difficult, just tedious. Serial dilutions are key...

          Comment

          • k-gun12
            Member
            • Feb 2010
            • 56

            #6
            I agree, but Illumina sequencing of these same cultures would not exhibit this problem. Granted, the assembly was in thousands and thousands of contigs, but there was no redundancy and the gene predictions could be trusted. Right now, I'd rather have a fragmented assembly that accurately reflects copy number instead of what outwardly appears to be very large and duplicated gene families. I suppose it depends on where your priorities are.

            Comment

            • rhall
              Senior Member
              • Aug 2012
              • 324

              #7
              It's always going to be difficult to assemble something that is highly heterozygous, if you have illumina data you may want to try http://www.genome.umd.edu/masurca.html there is some evidence that this approach better maintains the separation of haplotypes before overlap assembly.

              Comment

              • rhall
                Senior Member
                • Aug 2012
                • 324

                #8
                I'm having a problem understanding why Illumina assembly wouldn't show the same problem. Is the assumption that areas of high heterozygosity simply get broken in the De Bruijn graph? At some point even with Illumina data you will assemble out different haplotypes, particularly in highly hetrozygous regions.
                Why not just filter the pacbio contigs for consistent expected coverage of raw reads?

                Comment

                • cstack
                  Member
                  • May 2017
                  • 16

                  #9
                  Originally posted by k-gun12 View Post
                  I corrected, assembled and polished the genome with Canu, and was pretty pleased with the results until I blasted the genome into itself and found dozens and dozens of repeated DNA regions up to and gt 50kbp that occur in multiple contigs - usually at the ends but not always.
                  Could these be true repetitive sequence? They might occur at the ends of scaffolds because it is difficult to assembly long stretches of repeats.

                  Originally posted by k-gun12 View Post
                  It has gotten so bad that I've found chloroplast fragments assembled in with the genomic DNA contigs. Has anyone else encountered this?
                  I have had the same thing happen recently when I used PBjelly to fill in the gaps of a plant genome assembly using ~20x PacBio coverage. A large (~40kbp) fragment that seems to belong to the chloroplast was placed in the middle of a very large 10Gbp scaffold. The fragment was nested in a region with a lot of repetitive sequence, and it might have represented an LTR transposon, based on some quick scans with repeat masker.

                  I assumed that PBjelly was mis-placing an LTRtransposon or other repetitive sequence.

                  How did you work this out in the end?

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                    by SEQadmin2


                    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                    Here are nine questions we think about, in roughly the order they matter, before...
                    06-18-2026, 07:11 AM
                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Today, 11:10 AM
                  0 responses
                  6 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-17-2026, 06:09 AM
                  0 responses
                  41 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-09-2026, 11:58 AM
                  0 responses
                  102 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-05-2026, 10:09 AM
                  0 responses
                  123 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...