Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post bioinf Bioinformatics 12 02-06-2013 07:03 AM matador3000 De novo discovery 0 12-17-2011 07:31 AM DMCH Bioinformatics 1 11-30-2011 04:21 AM ewalt98 Bioinformatics 2 04-01-2011 01:21 PM Zimbobo Bioinformatics 0 04-15-2010 01:36 PM

 12-26-2010, 02:42 PM #1 bioinf Member   Location: Germany Join Date: Nov 2010 Posts: 25 Velvet assembler Hi guys. I'm trying to understand the paper "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs" http://genome.cshlp.org/content/18/5/821.long As a student of the Uni I'll be preparing slides for presenting the afore-mentioned paper in the class. I'd very appreciate if you could answer few questions if you have any knowledge in velvet assembler and de Bruijn graphs as I just simply can't understand the main "Construction" part. I also couldn't understand it from the .ppt presentation file on the web-site of the tool. 1) They wrote "the reads are first hashed according to a predefined k-mer length." What did they mean by those words? Did they mean that for each k-mer of that read the hash table contains a record where you store the id of the first read having this k-mer and the position of that k-mer in the read? So whenever the k-mer is encountered for the second time you just do nothing and leave the hash table as it is, since it already contains that k-mer? 2) What did they mean by the words "the ordered set of original k-mers of that read is cut each time an overlap with another read begins or ends"? Did they mean the total overlap of the k-mer with k first or last characters of some other read? What is meant by "cutting" the set? Could you please show an example? 3) When we construct the nodes of the graph, how is determining of the arcs between them being done? Thanks in advance and happy holidays!
12-27-2010, 04:08 AM   #2
boetsie
Senior Member

Location: NL, Leiden

Join Date: Feb 2010
Posts: 245

Hi bioinf,
Quote:
 1) They wrote "the reads are first hashed according to a predefined k-mer length." What did they mean by those words? Did they mean that for each k-mer of that read the hash table contains a record where you store the id of the first read having this k-mer and the position of that k-mer in the read? So whenever the k-mer is encountered for the second time you just do nothing and leave the hash table as it is, since it already contains that k-mer?

This simply means that they cut the sequence into k-mers, small k-size parts of the original sequence. For example if k=3, a sequence ACGTGCT will be cut as;

ACGTGCT
ACG
.CGT
..GTG
...TGC
....GCT

The same is done for the reverse complement of the read. If the k-mer is already present, the number of occurences of this k-mer is added by 1.

Quote:
 2) What did they mean by the words "the ordered set of original k-mers of that read is cut each time an overlap with another read begins or ends"? Did they mean the total overlap of the k-mer with k first or last characters of some other read? What is meant by "cutting" the set? Could you please show an example?
I'm not really sure, but what i think is that if there is a k-1 overlap with another read(in this example thus 3-1 = 2), the reads are cut. in the most ideal case this would be;

ACGTGCT
CTAGAAT

overlap;
ACGTGCT
.....CTAGAAT

However, the following could also occur;

ACGTGCT
GTGGGGG

overlap;
ACGTGCT
..GTGGGGG

Here, the node is cut at "GT". See the example at the poster;
http://www.ebi.ac.uk/~zerbino/velvet/velvet_poster.pdf

Quote:
 3) When we construct the nodes of the graph, how is determining of the arcs between them being done?
If there is a k-1 overlap (as in 2)), an arc is drawn between the two reads. So between read1 and read2 an arc is drawn. Same for read1 and read3

Also have a look at Daniel Zerbino's dissertation at;
http://www.ebi.ac.uk/training/ftp/Ph...el_Zerbino.pdf

Hope it helped,
Boetsie

12-27-2010, 05:32 AM   #3
bioinf
Member

Location: Germany

Join Date: Nov 2010
Posts: 25

Quote:
 Here, the node is cut at "GT". See the example at the poster;
This is one of the major things I don't get. Say if we have two reads, then why is it allowed to check whether the inner part of one read can overlap with any(inner/outer) part of the second?
Well, say we take k = 3 for such reads:

The paper has the following words:
Quote:
 A second database is created with the opposite information. It records, for each read, which of its original k-mers are overlapped by subsequent reads. The ordered set of original k-mers is cut each time an overlap with another read begins or ends. For each uninterrupted sequence of original k-mers a node is created.
That's really ambiguously described. Does it mean that I take a read, look at its k-mers from top to down and throw away the part of the list where the overlapping occurs and for the rest parts I make a graph nodes? Or is it meant that the cut-out part should also be a node?
Quote:
 If there is a k-1 overlap (as in 2)), an arc is drawn between the two reads.
How is it possible to find the supersequences from a graph if there is a node with two outgoing arcs? (like a tree with branches)

Last edited by bioinf; 12-27-2010 at 05:34 AM.

 12-27-2010, 10:30 AM #4 Zigster (Jeremy Leipzig)   Location: Philadelphia, PA Join Date: May 2009 Posts: 116 Reads are not the actors in a Velvet assembly, only kmers, so any mention of k-1 always refers to kmers (two kmers get an edge if they overlap by k-1). So forget about reads, at least until the scaffolding part, where paired-ends are leveraged. Once you clip the tips and compress the bubbles then traversing the graph is straightforward, but path ambiguities often do result in fragmented assemblies. __________________ -- Jeremy Leipzig Bioinformatics Programmer -- My blog Twitter
12-27-2010, 11:23 AM   #5
bioinf
Member

Location: Germany

Join Date: Nov 2010
Posts: 25

Thank you. Ok, its getting much clearer now. So do I understand you correctly?
- form all possible k-mers from the set of reads
- if k-1 end characters of some k-mer X intersect with k-1 beginning characters of k-mer Y, then put an arc from X-->Y (similar for the X<--Y case)
- simplify the graph by grouping the nodes where possible
- remove tips, bubbles
- walk the resulting graph from the beginning to the end to acquire the supersequence

Quote:
 but path ambiguities often do result in fragmented assemblies
Could you please explain what you've meant by the possibility of path-ambiguities?

Do I understand it correctly that in the end I should not get a tree-like resulting graph, since otherwise I have to travel several times to visit all nodes and I'll get several sequences instead of one final sequence?

Also why does the paper mention that we have to have a hash table which stores for each possible k-mer a record with the information about - in which read it was seen for the first time and at which position?

Last edited by bioinf; 12-27-2010 at 11:30 AM.

12-27-2010, 11:39 AM   #6
Zigster
(Jeremy Leipzig)

Join Date: May 2009
Posts: 116

the repeat problem is the primary reason for fragmented assemblies

the ambiguities here cannot be resolved so the assembly will likely be four contigs with repeat sequence attached

read tracking is needed to leverage paired end information
Attached Images
 Screen shot 2010-12-27 at 3.29.09 PM.png (18.9 KB, 531 views)
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog

12-27-2010, 11:52 AM   #7
bioinf
Member

Location: Germany

Join Date: Nov 2010
Posts: 25

Thx very much!

Why did you mention 4 possible assembles? I see only two. The one on the top of the picture and the same, but if you switch light green with the dark green. But I see, so the idea of the ambiguity is that depending on how we traverse we might get different contigs.
Quote:
 read tracking is needed to leverage paired end information

12-27-2010, 12:06 PM   #8
Zigster
(Jeremy Leipzig)

Join Date: May 2009
Posts: 116

The diagram on the top is the actual order (known only to the organism being sequenced)

The graph below it would represent what an assembler could piece from short reads.
Four contigs would be produced by an assembler:
Tan-Blue
Blue-DarkGreen-Blue
Blue-LightGreen-Blue
Blue-BurntOrange

With paired end that span repeats we can deduce the correct contig order:

Attached Images
 Screen shot 2010-12-27 at 3.47.44 PM.png (12.8 KB, 505 views)
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog

 12-27-2010, 01:28 PM #9 bioinf Member   Location: Germany Join Date: Nov 2010 Posts: 25 Thank you very much for the brief answers! I'm not closing the topic yet in case I have questions about the further stages like error-correction stuff.
 12-27-2010, 01:32 PM #10 boetsie Senior Member   Location: NL, Leiden Join Date: Feb 2010 Posts: 245 Have a look at this site for some basic information about de novo assembly; http://www.cbcb.umd.edu/research/assembly_primer.shtml
 12-28-2010, 03:19 AM #11 bioinf Member   Location: Germany Join Date: Nov 2010 Posts: 25 Thx bötsie. One more question about the de Bruijn graphs. Is the ability to reduce the memory consumption by just having k-mers instead of reads as nodes - the only advantage against overlap graphs?
12-28-2010, 05:24 AM   #12
boetsie
Senior Member

Location: NL, Leiden

Join Date: Feb 2010
Posts: 245

Hi,

Well, it's not the only advantage, but indeed the biggest. Other advantages;

- De Bruijn graph assemblers are much faster.
- easier to correct for errors (which come alot with short read sequences) within the sequences (see the poster).
- repeats are immediately recognizable while in an overlap graph they are more difficult to identify.

Boetsie

Quote:
 Originally Posted by bioinf Thx bötsie. One more question about the de Bruijn graphs. Is the ability to reduce the memory consumption by just having k-mers instead of reads as nodes - the only advantage against overlap graphs?

 01-05-2011, 08:39 AM #13 bioinf Member   Location: Germany Join Date: Nov 2010 Posts: 25 Hi again guys. I'm almost done with the presentation. I don't get something about the complementary strand assembly. Say I have a k-mer AATGC in the complementary strand read and the 100% same looking k-mer AATGC in the primary read. How does the algorithm see the difference? Whenever you analyze the read how do you know whether it is a primary or complementary strand? Last edited by bioinf; 01-05-2011 at 10:09 AM.
 01-05-2011, 09:30 AM #14 bioinf Member   Location: Germany Join Date: Nov 2010 Posts: 25 Perhaps someone could explain how the nodes for complementary strand are constructed and why? It is clear that there will be some reads from the complementary strand in the set, but what if there are less complementary reads than the primary ones? We still have to attach that complementary node to the primary one although that complementary node might not exist in the set of reads. How is that handled? Last edited by bioinf; 01-05-2011 at 09:33 AM.
 01-05-2011, 10:16 AM #15 bioinf Member   Location: Germany Join Date: Nov 2010 Posts: 25 A... I see. So they complement each other. There might be a primary read missing whereas its complementary read exists, so the nodes will get connected via arc and thus node overlapping information will not be lost. Last edited by bioinf; 01-05-2011 at 10:35 AM.
 01-05-2011, 01:12 PM #16 bioinf Member   Location: Germany Join Date: Nov 2010 Posts: 25 Can someone please explain why we need to have the HashMap and store there the id of the first read where that k-mer was encountered? Is it not just sufficient to walk the graph and write down the k-mers to build up the original sequence? What is this HashMap else used for?
 01-06-2011, 07:45 AM #17 Zigster (Jeremy Leipzig)   Location: Philadelphia, PA Join Date: May 2009 Posts: 116 I don't know where you got the word "HashMap" from - I think that is Java. Any association between reads and their kmers is for the purposes of paired-end resolution and read usage statistics. Are you going to put this presentation up somewhere? __________________ -- Jeremy Leipzig Bioinformatics Programmer -- My blog Twitter
01-06-2011, 09:20 AM   #18
bioinf
Member

Location: Germany

Join Date: Nov 2010
Posts: 25

I'm no biologist I'm a programmer. Hash map is not related to any specific language(Java, C++ etc), it is a data structure for a O(1) constant time access to an element (at least in the best case). The article describes that we keep the info about the first occurence of the k-mer in the hashmap. What I don't get is why we would need this information for a traceback? I can assemble the sequence by just following the arcs and writing down the k-mers. Why would I need an information about the reads which are represented by those k-mers after the graph is already constructed. Is it meant that the hashmap is needed for the construction itself and only? (question to all who might know)
Quote:
 Are you going to put this presentation up somewhere?
Of course. This is my seminar presentation at the Uni.
Quote:
 Any association between reads and their kmers is for the purposes of paired-end resolution and read usage statistics.
It can't be used for the usage statistics, since the hashmap contains the information about only the first read where certain k-mer is found. There might be several reads with the same k-mer, but at our disposal is the information of the location of only one such read.

Intuitively I think that it is done to link up all the reads which have such k-mer. Read set is analyzed one-by-one and each k-mer is added to the hash map in form of the id of the first read where it was found. Any subsequent requests in another reads for the storage of the same k-mer are denied. Afterwards when all information is stored we walk all reads again. Each time k-mer of some read is retrieved it is being looked up in the hashmap and there we find the id of the read where it was found for the first time so we can link these reads. The same is done further. We get such one-to-many correspondance. That's what I assume from the paper since it is stated unclear in it but I can't present my assumptions on the slides.

Last edited by bioinf; 01-06-2011 at 09:57 AM.

 01-08-2011, 09:03 AM #19 bioinf Member   Location: Germany Join Date: Nov 2010 Posts: 25 If going back to the biological details. Could you please explain how repeats in the DNA lead to the gaps between contigs? Yes they are overlapped although they shouldn't be, but how does it lead to "gaps"? Since velvet cuts all tips longer than 2k, then whenever a repeat with a big portion of sequence after it is overlapped to the k-mer which was found earlier such "tip" will be discarded. Last edited by bioinf; 01-08-2011 at 10:31 AM.
 01-28-2011, 05:39 AM #20 parit Junior Member   Location: Basel Switzerland Join Date: Jan 2011 Posts: 2 @bioinf: I am not sure I fully get your question but here are my two cents. If there is a repeat then either there will be a node reported with a coverage higher than the expected coverage or there will be a loop. In the later case, assembler, while making contigs, dont know the frequency of the repeat and hence cannot connect the contigs to the right and left of the repeat and therefore report them as 2 different contigs with a gap in between... As far as the tips are concerned, I couldnt connect "tips" with "repeats" as I thought tips occur when there is a sequencing error at the end of the read. It has nothing to do with repeat. Please do correct me if I am wrong as I am also trying to understand the logic of velvet. Can you also post your presentation or email me? - Parit