Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PacBio assembly evaluation, how to chose the 'best' one

    I've assembled several microbial genomes with the PacBio smrtportal 2.2.0 using various settings and assemblers (HGAP2 , HGAP3 on the same data) with coverages of about 90x up to 185x. I've been following recommendations on finishing that are on the github wiki to check for assembly artefacts.

    Although I get nicely closed contigs with these methods I do see about 50-100 differences between the various final assemblies, mainly indels without a preference for a particular type of assembly strategy.

    Some of these indels do have an effect on gene prediction and annotation, particularly frameshifts or truncation of proteins, again for as far as I've been able to see without a preference for a certain strategy. I've some examples where the 'correct' gene is predicted on a default hgap3 assembly but the hgap2 +RS_resequencing consensus contained frameshifts compared to highly similar genes in related organisms and vice versa.

    What would be a good strategy to assess the minute differences between these different assemblies in a structural way?

  • #2
    Two things you can do:

    - rerun quiver (the final step in the pipeline, RS_Resequencing I believe), this may help
    - polish with Illumina data, e.g. 50-100x MiSeq (different tools can do this, e.g. Pilon http://www.broadinstitute.org/software/pilon/)

    Comment


    • #3
      Hi VidJa,

      try QUAST (http://bioinf.spbau.ru/quast). It's a pretty handy tool for doing both statistical and biological evaluation of assemblies. It has a web interface but alternatively you can also run it locally from the command line.

      cheers

      Micha

      Comment


      • #4
        +1 for re-runing quiver, particularly if you have run any kind of circularization on the genome. I would also check the QV of the consensus calls that are problematic, note a lowercase call in the consensus is an indication that quiver was unable to call consensus for that region, e.g. no uniquely mapped reads.

        Comment


        • #5
          Another good (and quick) check can be the coverage.
          Just re-map all your reads onto the obtained genome and look at the coverage (should be already in your final results as well).
          If you look at the obtained coverage it should be very uniformly distributed among the entire genome. If you have sudden increase or
          decrease of coverage it often points to miss-assemblies.

          Comment


          • #6
            You need to be careful with that.
            I had some spikes in my Illumina PE mapping, which looked weird, and I thought about collapsed repeats. But later I got PacBio data, and just for "fun" I mapped my PE data to them, and the same spikes appeared.

            Comment


            • #7
              Testing circular genomes

              Hi Rhall,

              Can you please help with genome circularization analysis with Pacbio

              1. I have 2 contigs assembly. C1 - 4.3 Mb C2 - 13KB (average coverage above 40X)
              2. The C2 completely match with C1 -
              C2_chunk1 - first 1-6243 at location between 3.45-3.46 MB
              C2_chunk2 - Next 6216-13244 maps somewhere 2.49 MB
              There is 29KB gap between C2_chunk1 and C2_chunk2
              There are some mismatches in alignment.

              Now, I have following questions:
              1. Can C2 be a spurious contig as it is very small and completely contained withing C1? PacBio recommended to discard contigs below coverage 20X.
              2. Also, I tried to run Gepard with C1 to test circularization. Can you provide some link/guide to some example dot plots of circular genomes.
              3. Any other tools to check circularization? (Currently, I am trying to break the Contig1 and mapping the reads against it to check if reads support the overlap).

              Thanks
              sagar

              Comment


              • #8
                1. It is possible that contig C2 actually represents a sub-population of the sample being sequenced. This can be difficult to validate, one option is to try running the bridgemapper protocol (after removing the small contig from the reference) in SMRT Analysis and looking at the region to which it maps to see is there is evidence in the raw reads that the sample is not clonal in the region. It can also be useful to just blast the extra contig, or region to which it maps, and see if it hits any phage, or common insertion elements.
                2. http://files.pacb.com/Training/Circu...ard/story.html
                https://github.com/PacificBioscience...g-and-trimming
                http://omicsomics.blogspot.com/2013/...from-more.html
                3. Again, the Bridgemapper protocol can be used to check for circularity, as the reads at the end of the contig will show a bridged mapping.

                Comment


                • #9
                  Originally posted by sagarutturkar View Post
                  Hi Rhall,

                  Can you please help with genome circularization analysis with Pacbio

                  1. I have 2 contigs assembly. C1 - 4.3 Mb C2 - 13KB (average coverage above 40X)
                  2. The C2 completely match with C1 -
                  C2_chunk1 - first 1-6243 at location between 3.45-3.46 MB
                  C2_chunk2 - Next 6216-13244 maps somewhere 2.49 MB
                  There is 29KB gap between C2_chunk1 and C2_chunk2
                  There are some mismatches in alignment.

                  Now, I have following questions:
                  1. Can C2 be a spurious contig as it is very small and completely contained withing C1? PacBio recommended to discard contigs below coverage 20X.
                  2. Also, I tried to run Gepard with C1 to test circularization. Can you provide some link/guide to some example dot plots of circular genomes.
                  3. Any other tools to check circularization? (Currently, I am trying to break the Contig1 and mapping the reads against it to check if reads support the overlap).

                  Thanks
                  sagar
                  Hi,

                  In gepard, check the top right and bottom left corner of your alignment to see if there is a diagonal line which indicates overlapping ends to find out if your contig is circular, just align your contig together on both x and y axis.

                  Besides gepard, there is also a perl script (check_circularity.pl) in a long reads error correction package named SPRAI (http://zombie.cb.k.u-tokyo.ac.jp/sprai/index.html) (you can easily download it by using local package manager (LPM), which claimed to be able to identify circular contigs and remove the redundant ends, you might like to give it a try although when I try it using my gepard-validated circular contig it keeps showing that my circular contigs are linear contigs.

                  I am also trying to find out a tool which can help to trim the redundant ends of a circular contig and I find the AMOS packages indicated in the circularizing and trimming tutorial to be hard to install, any suggestion or page which can give a better guideline in terms of installation (for beginner in bioinformatics).

                  Thanks.

                  Comment


                  • #10
                    @EulnayM I see the same thing with Gepard and sprai's check_circularity.pl script. Gepard shows clear ~6kbp overlap (after zooming in on the 5'-end to 3'-end region of the dot plot) with ~100% identity, but check_circularity.pl reports it as linear. Has anyone debugged this script, or found a better replacement out there?

                    Comment


                    • #11
                      So long as you are comftable with using the command line, and have SMRT Analysis installed, https://github.com/PacificBioscience...-circularizemk can be used to check for circular contigs.
                      AMOS packages indicated in the circularizing and trimming tutorial to be hard to install
                      The executables that are needed for AMOS are now part of SMRT Analysis, if you have 2.3 installed.

                      Comment


                      • #12
                        @rhall Interesting ... I haven't seen this smrtmake stuff before. I'd note that the default OVLSIZE (20000) for circularize.mk seems large; I'm looking at two circular bacterial genomes (~5Mbp size) produced by HGAP.3 that each have overlapping ends in the 6-8kbp range. Also the overlapping regions aren't identical - they've got some indels in them, and could be as low a 99 or 98% identical.

                        So it seems like there are two options. (1) if the assembler has left overlapping ends, detect those, trim one of them, maybe permute the sequence to start at some commonly used gene (origin of replication), then check that reads still align across the implied join that's now somewhere within the fasta sequence, maybe using QUIVER to also do some final polishing. (2) if there aren't overlapping ends, use bridgemapper to see if the ends are reasonable close to each other? Then, the scaffolding protocol? If there's some good reason (like some incredibly long repeats, or low coverage) that the assembly failed at the ends in the first place, this seems like it's bound to fail.

                        The problem is, all this stuff takes eyes-on time for each sequence, chromosome, plasmid, whatever. What we really need are some assemblers or scaffolders that are aware of circularity.

                        Comment


                        • #13
                          The overlap size is a maximum (overlap is a function of the size of the preassembled reads, overlaps approaching 20kb are not uncommon given really good libraries), overlaps in the range 6-8kb should be found, the default minimum identity is 98%.
                          To robustly detect circularity I would look at the assembly overlap graph stage. Using the Falcon toolbox, it should be relatively simple to label circular contigs when constructing them from the graph. It should also be possible to detect circularity at the same stage in Celera assembler, but this would likely be more difficult to implement.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          18 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          22 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          16 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          47 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X