Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best methods for merging heterozygous contigs?

    Greetings!

    I'm dealing with an assembly from a diploid organism and am in the process of reducing it to a haploid assembly as much as I can. When I map my illumina reads back to my contigs using BBmap I get a very distinct bimodal distribution.

    Click image for larger version

Name:	contig_coverage.jpeg
Views:	1
Size:	78.5 KB
ID:	308622

    As you can see from the image there are two distinct coverage peaks and I assume the higher coverage peak represents the contigs that are haploidified pieces of the genome whereas the lower coverage peak represents the contigs that are separately assembled haplotypes each getting half the reads.

    If that is the case I would like to know how you would go about finding out which contigs are duplicates and can be merged. I've tried blasting all vs. all after repeat masking but the local alignment algorithm only returns small pieces that match and I don't know how to extract the information I need for whole contigs.

    How do you deal with this issue?

    Cheers,
    -Jason
    Last edited by Nordic; 11-18-2014, 03:04 AM.

  • #2
    The assembler AllPaths-LG has a "haploidify" mode which will try to reduce polymorphic sites into a single haplotype; it's somewhat successful. Though that's not the easiest assembler to use.

    If you want to keep your current assembly, you can do an all-vs-all alignment with Dedupe, which does global alignments. It can automatically remove contigs that match others within a specified hamming distance or edit distance, leaving just one unique copy. It only does this for containments; in situations where one overlaps the other but neither is fully contained, it will keep both of them. However, it can produce a graph (in dot format) of all of the overlaps that fulfill your criteria that you can post-process to manually identify and remove semi-duplicates.

    Depending on how highly heterozygous your organism is, Dedupe could solve the problem very quickly easily, though if the het rate is really high (like 1/20) it might not be effective at all.

    Command line:

    dedupe2.sh in=contigs.fa out=unique.fa am=t ac=t fo c mo=500 e=15 nam=4 k=25 dot=graph.dot

    In this case, "e=15" allows a maximum edit distance of 15, "mo=500" requires a minimum overlap of 500bp. You can adjust those. Alternately, "s=15" instead of "e=15" would allow 15 substitutions but no indels.

    Dedupe is packaged with BBMap, so you already have it.

    Comment


    • #3
      Hi Brian,

      First let me thank you for the suite of tools you have made available to the community. I have used them extensively for months now and find that many of the programs provide a fast and easily tune-able solution for the common problems encountered in genome assembly. If I wish to cite you for these tools, how do you prefer I do so?

      Ok, to the problem at hand. This assembly is in fact the result of an Allpaths assembly with the haploidify option enabled. The organism in question is an inbred sample of a wild butterfly population so heterozygosity is to be expected. I've already used the dedupe tool to absorb exact matches, of which there were about 20 in the ~60,000 contigs. When run to absorb matches and containments I get results like so:

      Your suggestion about using the clustering algorithm, however, takes me into unfamiliar territory. When I run dedupe2 on the contig fasta file with

      dedupe2.sh \
      in=ap.fasta \
      out=ap_uniq.fasta \
      am=t \
      ac=t \
      fo=t \
      c=t \
      mo=500 \
      e=15 \
      nam=4 \
      k=25 \
      dot=ap.dot

      I get back a fasta file which appears to be the subset of clustered contigs as ap_uniq.fasta containing about 2000 contigs and their relationship in the ap.dot file. Now I don't know much about how your clustering algorithm works, or how clustering algorithms work in general, but I was hoping you could shed some light on things for for.

      Briefly, how does your clustering algorithm work?
      Does the program look for reverse-complement overlaps?
      How are the suffix/prefixes of contigs used and how does increasing the number stored affect the program? For example, when I increase the number of stored suffixes the number of containments increases as well.
      Whats the difference between dedupe and dedupe2 (or for that matter any of the other tools and tool2)?

      Two examples of graphs I get back are

      25036.1
      25036.1 -> 46658.1 [label="F,539,0,0,2595,3133,0,538"]
      25037.1
      25037.1 -> 25036.1 [label="F,1999,0,0,0,1998,1135,3133"]
      25037.1 -> 46659.1 [label="F,1419,5,5,3249,4667,0,1418"]
      46658.1
      46659.1

      and

      14460.1
      14460.1 -> 14461.1 [label="F,1790,14,14,6822,8611,0,1791"]
      14460.1 -> 14461.1 [label="F,1792,14,14,6821,8611,0,1791"]
      14461.1
      14461.1 -> 14462.1 [label="F,569,0,0,4618,5186,0,568"]
      14462.1

      The first is a tree and the second is a cycle. Does the string of numbers in the label signify coordinates, scores, or something else?

      Finally, once I have this information about the relationship of the contigs what would you do in this case to post process? Would the next step then be to visually inspect the contigs indicated in the graph for overlaps and manually merge?

      Sorry for the barrage of questions, but you have some great tools here and I would like to be able to use them to their fullest capacity.

      Best regards,
      Jason
      Last edited by Nordic; 11-20-2014, 07:37 AM.

      Comment


      • #4
        Originally posted by Nordic View Post
        Hi Brian,

        First let me thank you for the suite of tools you have made available to the community. I have used them extensively for months now and find that many of the programs provide a fast and easily tune-able solution for the common problems encountered in genome assembly. If I wish to cite you for these tools, how do you prefer I do so?
        You're welcome! And, since (still) none of my tools are published, please cite them with a link to http://sourceforge.net/projects/bbmap/

        Ok, to the problem at hand. This assembly is in fact the result of an Allpaths assembly with the haploidify option enabled. The organism in question is an inbred sample of a wild butterfly population so heterozygosity is to be expected. I've already used the dedupe tool to absorb complete containments, of which there were about 20 in the ~60,000 contigs. You suggestion about using the clustering algorithm, however, takes me into unfamiliar territory.

        When I run dedupe2 on the contig fasta file with

        dedupe2.sh \
        in=ap.fasta \
        out=ap_uniq.fasta \
        am=t \
        ac=t \
        fo=t \
        c=t \
        mo=500 \
        e=15 \
        nam=4 \
        k=25 \
        dot=ap.dot

        I get back a fasta file which appears to be the subset of clustered contigs as ap_uniq.fasta containing about 2000 contigs and their relationship in the ap.dot file. Now I don't know much about how your clustering algorithm works, or how clustering algorithms work in general, but I was hoping you could shed some light on things for for.
        You can also use "pattern=%.fa" to write one file per cluster, or "rnc" to rename the contigs to indicate which cluster they belong to.
        Briefly, how does your clustering algorithm work?
        Does the program look for reverse-complement overlaps?
        How are the suffix/prefixes of contigs used and how does increasing the number stored affect the program? For example, when I increase the number of stored suffixes the number of containments increases as well.
        Whats the difference between dedupe and dedupe2 (or for that matter any of the other tools and tool2)?
        Dedupe allows up to 2 prefixes and suffixes; Dedupe2 allows an unlimited number, at the expense of more memory. The more memory does not matter for single organism assemblies, it's just an issue when clustering raw reads or metagenomes.

        BBDuk2 is like BBDuk, but with a different syntax and more power (it can do multiple kmer-processing steps at once). But I didn't remove BBDuk because it's easier to use and it was already integrated into some pipelines with the current syntax.

        As for how the program works - I have some neat slides explaining it that I will post later. But briefly, the kmers at the ends of each contig are indexed. All kmers in all contigs are examined to look for exact matches to indexed kmers. For every match, the two contigs are compared to see if there is a valid overlap (or containment) fulfilling the criteria.

        So, increasing the number of kmer prefixes and suffixes stored, and reducing the kmer length, both increase sensitivity marginally if you are allowing edits/substitutions. There is no downside other than increased runtime and memory usage. However, if you are not allowing any mismatches or edits, then 1 prefix and suffix is sufficient to find all overlaps (and if you only want containments, 1 prefix and no suffix is enough).

        And yes, the program does look for reverse-complements.


        Two examples of graphs I get back are

        25036.1
        25036.1 -> 46658.1 [label="F,539,0,0,2595,3133,0,538"]
        25037.1
        25037.1 -> 25036.1 [label="F,1999,0,0,0,1998,1135,3133"]
        25037.1 -> 46659.1 [label="F,1419,5,5,3249,4667,0,1418"]
        46658.1
        46659.1

        and

        14460.1
        14460.1 -> 14461.1 [label="F,1790,14,14,6822,8611,0,1791"]
        14460.1 -> 14461.1 [label="F,1792,14,14,6821,8611,0,1791"]
        14461.1
        14461.1 -> 14462.1 [label="F,569,0,0,4618,5186,0,568"]
        14462.1

        The first is a tree and the second is a cycle. Does the string of numbers in the label signify coordinates, scores, or something else?

        Finally, once I have this information about the relationship of the contigs what would you do in this case to post process? Would the next step then be to visually inspect the contigs indicated in the graph for overlaps and manually merge?

        Sorry for the barrage of questions, but you have some great tools here and I would like to be able to use them to their fullest capacity.

        Best regards,
        Jason
        Dedupe's clustering is much closer to an assembler than other clustering algorithms. It finds all overlaps that fit your criteria (max edit distance and minimum overlap length), and then defines any set of contigs that are connected through transitive edges as a cluster. Even if two contigs don't overlap, they will be in the same cluster if there is a path connecting them through overlaps of intermediate contigs. So, theoretically, all the reads from a deeply-sequenced organism should form one cluster per chromosome if you allow a min overlap of less than read length. In practice, duplicate regions will typically make it all collapse into one single huge cluster, or low coverage will fragment it into multiple clusters.

        There's a flag "ngn" (numbergraphnodes) that is on by default; to get the graph nodes displayed by contig names, you will need to set that to false. By node 1 is the first input contig, etc. The ".1" is there to handle clustering of paired reads; if the node was read 2, it would be ".2". So, in the dot file:

        Each node is a contig.

        25036.1
        means "this cluster contains contig number 25036".

        25036.1 -> 46658.1 [label="F,539,0,0,2595,3133,0,538"]
        means "There is an edge (overlap) between contig 25036 and 46658, of length 539bp, with 0 edits, in a forward orientation. The overlap goes from position 2595 to 3133 on contig 25036, and from 0 to 538 on contig 46658."

        Key:

        contigA -> contigB [label="orientation,length,substitutions,edits,startA,stopA,startB,stopB"]

        Currently, when you allow an edit distance, the number of substitutions is set to the number of edits, so you can ignore that number. The orientation is one of:
        F (forward): Both contigs are in the same orientation.
        FRC (forward-reverse complement): Contig A overlaps with the reverse-complement of B (so the start coordinates of B will less than its stop coordinate)
        R (reverse): Both contigs are in the same orientation, but their coordinates are reversed because the overlap was detected using a kmer on the right end.
        RRC (reverse-reverse complement): A overlaps with the reverse-complement of B, and their coordinates are both reversed.

        Basically, F and R are identical except the coordinates are reversed, as are FRC and RRC; the second ones just exists to indicate how the overlap was found, but are not informative about the overlap itself.

        In this case:
        14460.1 -> 14461.1 [label="F,1790,14,14,6822,8611,0,1791"]
        14460.1 -> 14461.1 [label="F,1792,14,14,6821,8611,0,1791"]

        ...the same overlap was found twice, but got labelled differently because there is a little bit of ambiguity concerning overlap length and edit distance when indels are at the very end of the contig. You can remove these by adding the "pc" (process clusters) and "fmj" (fix multi-joins) flags; they are usually not helpful. After removal it's no longer a cycle.

        The easiest clusters to merge would be pairs of A -> B with no other overlaps. As long as there are no regions where more than 2 things overlap, chains of A -> B -> C are also easy. But it rapidly gets complicated when there are real cycles or trees, which is why I don't have a mode in dedupe for merging overlaps. I could add a mode for merging the trivial ones, though, I suppose.

        There is another tool, Minimus, that will attempt to automatically merge overlapping segments. I kind of worry about doing that; sometimes it introduces misassemblies. But you can try it.

        If you want to merge manually according to the graph, visualizing it with graphviz might be helpful - it can render dot files as png or pdf images. But I can't really give you good advice on how to proceed - I've never done it If it was straightforward I already would have automated it.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          You're welcome! And, since (still) none of my tools are published, please cite them with a link to http://sourceforge.net/projects/bbmap/
          You got it.

          ...which is why I don't have a mode in dedupe for merging overlaps. I could add a mode for merging the trivial ones, though, I suppose.
          Yea, I saw that commented out in the script. Here's hoping!

          Thank you for the very clear explanation. Fortunately most of the overlaps seem to be 1 contig -> 1 contig, so while it may take a while I think this could be a fruitful approach. Keep up the good work!

          Best,
          Jason

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          9 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          67 views
          0 likes
          Last Post seqadmin  
          Working...
          X