Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • OLC vs de Bruijn Performance on heterogeneous 454-reads

    Dear forum-members,

    I've had a puzzling reply by a reviewer on the claim that:

    "Differences in the accurary of assembly between heterogeneous 454- transcriptome reads between de bruijn and OLC based methods can be expected due to the different way these two algorithms work."

    To this the reviewer responded by saying that "both paradigms have the same mathematical characteristics and there's not inherent advantage to either".

    I'm a biologist and not a computer scientist, so pardon me If I'm being ignorant here. Could someone refer me to a good tutorial on this or explain in laymans terms why they think the reviewers response is either right or wrong?

    Thanks!

  • #2
    It's a great question and I am not aware that there is a definitive answer.

    I think it's fair to say that OLC & de Bruijn graphs are different algorithms and might be expected to produce different assemblies. Certainly that's easily tested.

    Does that mean one produces more or less accurate assemblies? That would surely depend on the characteristics of the dataset and your definition of accuracy.

    I think it would be fair to modify your statement to:

    "Differences in the assembly between heterogeneous 454- transcriptome reads between de bruijn and OLC based methods can be expected due to the different way these two algorithms work."

    However I would say that de Bruijn is not a typical algorithm for handling 454 read data which tends to be long and shallow coverage with indels, none of which suit that algorithm well.

    Comment


    • #3
      Originally posted by thomasvangurp View Post
      To this the reviewer responded by saying that "both paradigms have the same mathematical characteristics and there's not inherent advantage to either".


      DBG and OLC are inherently very different approaches, with OLC being the more flexible but heavier - OLC works better with sanger/454, but gets swamped with larger next-gen datasets.

      You can only really consider them equivalent if you refine the result of each into a 'string graph', but i don't know of any assembler which actually does things this way - most of them work with the DBG or OLC graphs and extract contigs directly from them. Due to both the heuristics inherent in DBG / OLC construction and the heuristics used to interpret them mean that the results from such assemblers will be massively different.

      Comment


      • #4
        Hi there

        I think there are various issues here

        1. First, something that is not as pedantic as it first sounds. De Bruijn and overlap graphs are not algorithms. They are data structures. To give a broad analogy, they are different ways of filing and summarising your data, but say not much about what you do with the data once it is stored. The reviewer's statement that they have the same mathematical characteristics is reasonable, although there is a lot of devil in the detail. (see below) In principle one might apply the same algorithm to both data structures.

        2. In the special case of infinite coverage, if you choose the right parameters (de Bruiijn kmer=overlap=read-length-1) then the overlap and de Bruijn graphs are the same. Because of this people tend to think of them as equivalent. However with finite coverage, it is unknown whether the two formulations are equivalent. If you need a reference for that, I think the end of Richard Durbin and Jared Simpson's FM index paper will do. For a given depth of coverage and given genome (which implies something about repeat structure), and given read length, it's not clear that you necessarily make the same choices of kmer/overlap parameter for the two approaches and therefore it's not clear you get equivalent results. HOWEVER....

        3. The real issue is one of experimental design, cost, and of data properties. Overlap graphs do not scale so well (in general) with volumes of data, and so tend to be used with longer reads and lower coverage. That said, look up the SGA paper (again Simpson and Durbin), recently out in Genome Research. 454 data is expensive but the reads are longer, so you sequence to lower depth (which means it is harder to deal with errors). De Bruijn graphs should scale better with coverage,but then your choice of kmer requires a trade off between repeat resolution and coverage. In short - de Bruijn assemblers and overlap assemblers tend to be used on different TYPES of data with implicitly different experimental design (read length and coverage). This implies a difference in assembly properties.


        4. Generally, all assembler papers have an introduction where they describe a general data structure and some algorithms, and then deep in the details they have a bunch of heuristics. These will also have a big effect on the differences between results of specific different assemblers.

        5. Transcriptome assembly is hard, and I would expect the major differences in assembly properties not to be due to the data structure, but in how much work has gone into the actual assembler itself.



        So returning to your original question, I'm not sure I understand the sentence as you typed it. "..between heterogeneous 454-transcriptome reads between de Bruijn..etc". Do you mean given a bunch of 454 transcriptome reads, you'd expect de Bruijn and overlap assemblers to perform differently? I think it's not a very helpful thing to think about. Different assembly tools will perform differently depending on how much work has gone in.Depending on your depth of coverage, and whether the specific assembler can cope with the 454 error model, and whether it is implemented/tested/designed well, you'll get better or worse results.

        I'd recommend Mihai Pop's excellent 2009 article (De novo assembly reborn) as a good introduction to various issues

        best

        Zam

        Comment


        • #5
          Sorry - it was obvious from the subject of your post what you were asking - I should talk less and think/read more

          Comment


          • #6
            That's a good summary by Zam and he/she is quite right to point out that de Bruijn graphs are data structures not algorithms and that is key to understanding them.

            I just came across this article via Simon Cockell, Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph:

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X