Seqanswers Leaderboard Ad

**nickloman** · 12-19-2011, 03:33 AM

It's a great question and I am not aware that there is a definitive answer.

I think it's fair to say that OLC & de Bruijn graphs are different algorithms and might be expected to produce different assemblies. Certainly that's easily tested.

Does that mean one produces more or less accurate assemblies? That would surely depend on the characteristics of the dataset and your definition of accuracy.

I think it would be fair to modify your statement to:

"Differences in the assembly between heterogeneous 454- transcriptome reads between de bruijn and OLC based methods can be expected due to the different way these two algorithms work."

However I would say that de Bruijn is not a typical algorithm for handling 454 read data which tends to be long and shallow coverage with indels, none of which suit that algorithm well.

**tonybolger** · 12-20-2011, 03:14 AM

Originally posted by thomasvangurp View Post

To this the reviewer responded by saying that "both paradigms have the same mathematical characteristics and there's not inherent advantage to either".

DBG and OLC are inherently very different approaches, with OLC being the more flexible but heavier - OLC works better with sanger/454, but gets swamped with larger next-gen datasets.

You can only really consider them equivalent if you refine the result of each into a 'string graph', but i don't know of any assembler which actually does things this way - most of them work with the DBG or OLC graphs and extract contigs directly from them. Due to both the heuristics inherent in DBG / OLC construction and the heuristics used to interpret them mean that the results from such assemblers will be massively different.

**Zam** · 12-22-2011, 09:05 AM

Hi there

I think there are various issues here

1. First, something that is not as pedantic as it first sounds. De Bruijn and overlap graphs are not algorithms. They are data structures. To give a broad analogy, they are different ways of filing and summarising your data, but say not much about what you do with the data once it is stored. The reviewer's statement that they have the same mathematical characteristics is reasonable, although there is a lot of devil in the detail. (see below) In principle one might apply the same algorithm to both data structures.

2. In the special case of infinite coverage, if you choose the right parameters (de Bruiijn kmer=overlap=read-length-1) then the overlap and de Bruijn graphs are the same. Because of this people tend to think of them as equivalent. However with finite coverage, it is unknown whether the two formulations are equivalent. If you need a reference for that, I think the end of Richard Durbin and Jared Simpson's FM index paper will do. For a given depth of coverage and given genome (which implies something about repeat structure), and given read length, it's not clear that you necessarily make the same choices of kmer/overlap parameter for the two approaches and therefore it's not clear you get equivalent results. HOWEVER....

3. The real issue is one of experimental design, cost, and of data properties. Overlap graphs do not scale so well (in general) with volumes of data, and so tend to be used with longer reads and lower coverage. That said, look up the SGA paper (again Simpson and Durbin), recently out in Genome Research. 454 data is expensive but the reads are longer, so you sequence to lower depth (which means it is harder to deal with errors). De Bruijn graphs should scale better with coverage,but then your choice of kmer requires a trade off between repeat resolution and coverage. In short - de Bruijn assemblers and overlap assemblers tend to be used on different TYPES of data with implicitly different experimental design (read length and coverage). This implies a difference in assembly properties.

4. Generally, all assembler papers have an introduction where they describe a general data structure and some algorithms, and then deep in the details they have a bunch of heuristics. These will also have a big effect on the differences between results of specific different assemblers.

5. Transcriptome assembly is hard, and I would expect the major differences in assembly properties not to be due to the data structure, but in how much work has gone into the actual assembler itself.

So returning to your original question, I'm not sure I understand the sentence as you typed it. "..between heterogeneous 454-transcriptome reads between de Bruijn..etc". Do you mean given a bunch of 454 transcriptome reads, you'd expect de Bruijn and overlap assemblers to perform differently? I think it's not a very helpful thing to think about. Different assembly tools will perform differently depending on how much work has gone in.Depending on your depth of coverage, and whether the specific assembler can cope with the 454 error model, and whether it is implemented/tested/designed well, you'll get better or worse results.

I'd recommend Mihai Pop's excellent 2009 article (De novo assembly reborn) as a good introduction to various issues

best

Zam

**Zam** · 12-22-2011, 09:15 AM

Sorry - it was obvious from the subject of your post what you were asking - I should talk less and think/read more

**nickloman** · 01-04-2012, 02:21 AM

That's a good summary by Zam and he/she is quite right to point out that de Bruijn graphs are data structures not algorithms and that is key to understanding them.

I just came across this article via Simon Cockell, Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph:

Attention Required! | Cloudflare

http://www.citeulike.org/user/sjcockell/article/10150999?utm_source=dlvr.it&utm_medium=twitter

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

OLC vs de Bruijn Performance on heterogeneous 454-reads

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News