Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Velvet assembler

    Hi guys. I'm trying to understand the paper "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs" http://genome.cshlp.org/content/18/5/821.long

    As a student of the Uni I'll be preparing slides for presenting the afore-mentioned paper in the class. I'd very appreciate if you could answer few questions if you have any knowledge in velvet assembler and de Bruijn graphs as I just simply can't understand the main "Construction" part. I also couldn't understand it from the .ppt presentation file on the web-site of the tool.

    1) They wrote "the reads are first hashed according to a predefined k-mer length." What did they mean by those words? Did they mean that for each k-mer of that read the hash table contains a record where you store the id of the first read having this k-mer and the position of that k-mer in the read? So whenever the k-mer is encountered for the second time you just do nothing and leave the hash table as it is, since it already contains that k-mer?

    2) What did they mean by the words "the ordered set of original k-mers of that read is cut each time an overlap with another read begins or ends"? Did they mean the total overlap of the k-mer with k first or last characters of some other read? What is meant by "cutting" the set? Could you please show an example?

    3) When we construct the nodes of the graph, how is determining of the arcs between them being done?

    Thanks in advance and happy holidays!

  • #2
    Hi bioinf,
    1) They wrote "the reads are first hashed according to a predefined k-mer length." What did they mean by those words? Did they mean that for each k-mer of that read the hash table contains a record where you store the id of the first read having this k-mer and the position of that k-mer in the read? So whenever the k-mer is encountered for the second time you just do nothing and leave the hash table as it is, since it already contains that k-mer?

    This simply means that they cut the sequence into k-mers, small k-size parts of the original sequence. For example if k=3, a sequence ACGTGCT will be cut as;


    ACGTGCT
    ACG
    .CGT
    ..GTG
    ...TGC
    ....GCT


    The same is done for the reverse complement of the read. If the k-mer is already present, the number of occurences of this k-mer is added by 1.

    2) What did they mean by the words "the ordered set of original k-mers of that read is cut each time an overlap with another read begins or ends"? Did they mean the total overlap of the k-mer with k first or last characters of some other read? What is meant by "cutting" the set? Could you please show an example?
    I'm not really sure, but what i think is that if there is a k-1 overlap with another read(in this example thus 3-1 = 2), the reads are cut. in the most ideal case this would be;

    read1;
    ACGTGCT
    read2:
    CTAGAAT

    overlap;
    ACGTGCT
    .....CTAGAAT


    However, the following could also occur;

    read1;
    ACGTGCT
    read2:
    GTGGGGG

    overlap;
    ACGTGCT
    ..GTGGGGG


    Here, the node is cut at "GT". See the example at the poster;


    3) When we construct the nodes of the graph, how is determining of the arcs between them being done?
    If there is a k-1 overlap (as in 2)), an arc is drawn between the two reads. So between read1 and read2 an arc is drawn. Same for read1 and read3

    Also have a look at Daniel Zerbino's dissertation at;
    We train scientists at all levels to get the most out of publicly available biological data.


    Hope it helped,
    Boetsie

    Comment


    • #3
      Hi, thank you for the answer and the link.
      Here, the node is cut at "GT". See the example at the poster;
      This is one of the major things I don't get. Say if we have two reads, then why is it allowed to check whether the inner part of one read can overlap with any(inner/outer) part of the second?
      Well, say we take k = 3 for such reads:

      Read_1: DDCCTGTAC
      Read_2: NMCCTKLASSA

      The paper has the following words:
      A second database is created with the opposite information. It records, for each read, which of its original k-mers are overlapped by subsequent reads. The ordered set of original k-mers is cut each time an overlap with another read begins or ends. For each uninterrupted sequence of original k-mers a node is created.
      That's really ambiguously described. Does it mean that I take a read, look at its k-mers from top to down and throw away the part of the list where the overlapping occurs and for the rest parts I make a graph nodes? Or is it meant that the cut-out part should also be a node?
      If there is a k-1 overlap (as in 2)), an arc is drawn between the two reads.
      How is it possible to find the supersequences from a graph if there is a node with two outgoing arcs? (like a tree with branches)
      Last edited by bioinf; 12-27-2010, 06:34 AM.

      Comment


      • #4
        Reads are not the actors in a Velvet assembly, only kmers, so any mention of k-1 always refers to kmers (two kmers get an edge if they overlap by k-1).

        So forget about reads, at least until the scaffolding part, where paired-ends are leveraged.

        Once you clip the tips and compress the bubbles then traversing the graph is straightforward, but path ambiguities often do result in fragmented assemblies.
        --
        Jeremy Leipzig
        Bioinformatics Programmer
        --
        My blog
        Twitter

        Comment


        • #5
          Thank you. Ok, its getting much clearer now. So do I understand you correctly?
          - form all possible k-mers from the set of reads
          - if k-1 end characters of some k-mer X intersect with k-1 beginning characters of k-mer Y, then put an arc from X-->Y (similar for the X<--Y case)
          - simplify the graph by grouping the nodes where possible
          - remove tips, bubbles
          - walk the resulting graph from the beginning to the end to acquire the supersequence

          but path ambiguities often do result in fragmented assemblies
          Could you please explain what you've meant by the possibility of path-ambiguities?

          Do I understand it correctly that in the end I should not get a tree-like resulting graph, since otherwise I have to travel several times to visit all nodes and I'll get several sequences instead of one final sequence?

          Also why does the paper mention that we have to have a hash table which stores for each possible k-mer a record with the information about - in which read it was seen for the first time and at which position?
          Last edited by bioinf; 12-27-2010, 12:30 PM.

          Comment


          • #6
            the repeat problem is the primary reason for fragmented assemblies



            the ambiguities here cannot be resolved so the assembly will likely be four contigs with repeat sequence attached

            read tracking is needed to leverage paired end information
            Attached Files
            --
            Jeremy Leipzig
            Bioinformatics Programmer
            --
            My blog
            Twitter

            Comment


            • #7
              Thx very much!

              Why did you mention 4 possible assembles? I see only two. The one on the top of the picture and the same, but if you switch light green with the dark green. But I see, so the idea of the ambiguity is that depending on how we traverse we might get different contigs.
              read tracking is needed to leverage paired end information
              Where could I read about how this is done?

              Comment


              • #8
                The diagram on the top is the actual order (known only to the organism being sequenced)

                The graph below it would represent what an assembler could piece from short reads.
                Four contigs would be produced by an assembler:
                Tan-Blue
                Blue-DarkGreen-Blue
                Blue-LightGreen-Blue
                Blue-BurntOrange

                With paired end that span repeats we can deduce the correct contig order:


                google "paired-end" "mate pairs" "scaffolding" "contig order" to learn more
                Attached Files
                --
                Jeremy Leipzig
                Bioinformatics Programmer
                --
                My blog
                Twitter

                Comment


                • #9
                  Thank you very much for the brief answers! I'm not closing the topic yet in case I have questions about the further stages like error-correction stuff.

                  Comment


                  • #10
                    Have a look at this site for some basic information about de novo assembly;

                    Comment


                    • #11
                      Thx bötsie. One more question about the de Bruijn graphs.
                      Is the ability to reduce the memory consumption by just having k-mers instead of reads as nodes - the only advantage against overlap graphs?

                      Comment


                      • #12
                        Hi,

                        Well, it's not the only advantage, but indeed the biggest. Other advantages;

                        - De Bruijn graph assemblers are much faster.
                        - easier to correct for errors (which come alot with short read sequences) within the sequences (see the poster).
                        - repeats are immediately recognizable while in an overlap graph they are more difficult to identify.

                        Boetsie

                        Originally posted by bioinf View Post
                        Thx bötsie. One more question about the de Bruijn graphs.
                        Is the ability to reduce the memory consumption by just having k-mers instead of reads as nodes - the only advantage against overlap graphs?

                        Comment


                        • #13
                          Hi again guys. I'm almost done with the presentation. I don't get something about the complementary strand assembly. Say I have a k-mer AATGC in the complementary strand read and the 100% same looking k-mer AATGC in the primary read. How does the algorithm see the difference? Whenever you analyze the read how do you know whether it is a primary or complementary strand?
                          Last edited by bioinf; 01-05-2011, 11:09 AM.

                          Comment


                          • #14
                            Perhaps someone could explain how the nodes for complementary strand are constructed and why? It is clear that there will be some reads from the complementary strand in the set, but what if there are less complementary reads than the primary ones? We still have to attach that complementary node to the primary one although that complementary node might not exist in the set of reads. How is that handled?
                            Last edited by bioinf; 01-05-2011, 10:33 AM.

                            Comment


                            • #15
                              A... I see. So they complement each other. There might be a primary read missing whereas its complementary read exists, so the nodes will get connected via arc and thus node overlapping information will not be lost.
                              Last edited by bioinf; 01-05-2011, 11:35 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X