Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why using k-mer?

    in velvet & trans abyss (de novo assembly), we are using the k-mer approach.

    why is it better to "break" each contig into a range of k-mer, instead of regular overlapping?
    why is it more sensitive? snp?

    thanks in advance..

  • #2
    you mean the de bruijn approach? supposedly it resolves repeats better than overlapping method (OLC)

    Comment


    • #3
      Yes, de bruijn approach.
      the structure of the graph is handling repeats(the hash table).
      in transcriptome there are not so many repeats..
      but why to do it on a range of different k-mers? {20..49}

      Comment


      • #4
        compared to the overlapping method it needs less RAM when you have a lot of "short" reads as input. A assembler with de brujin graphs is for next genSeq output.

        edit: the higher the kmer the higher the less RAM you need, because normally the de brujin graph will be smaller. With kmer 49 the overlap between reeds must be 49-1bp! If you want more details about the algorithms you could get daniel zerbinos phd thesis, it is easy to read and for the understanding it helps a lot.
        Last edited by Thorondor; 03-07-2011, 12:34 AM.

        Comment


        • #5
          In transcriptome, specifying different k-mers is applied to accommodate transcripts with different sizes.

          Comment


          • #6
            i dont understand.
            if i have this contig:
            AGTCAGTTTGGCCCTTG
            assume this is the output of solexa.

            is it all from the same transcript?

            how the k-mer accommodate with different sizes of this transcript??.

            Comment


            • #7
              in transcriptome, the reads come from many dna transcripts, which is why the assembler uses different k-mer sizes to try to assemble them correctly. Meanwhile, in denovo assembly, the reads come from the whole genomic dna (one big sequence).

              As for how exactly different k-mer accommodate transcripts, you might want to read the Oases paper. "Oases: De novo transcriptome assembler for very short reads".

              Comment


              • #8
                if i have contig in length 50 bp,
                how it will help me if i will break it to peices of 19bp with 18bp overlap to know its transcript size?
                and so on {19..49}

                fix me if i wrong..
                i am not sure that i understood you correctley.
                did you mean that maybe the size of the current sub-transcript is 19, and if i will leave it size 50bp, i will miss the 19bp?
                that is why i have to use k=19?

                Comment


                • #9
                  Originally posted by Thorondor View Post
                  compared to the overlapping method it needs less RAM when you have a lot of "short" reads as input. A assembler with de brujin graphs is for next genSeq output.

                  edit: the higher the kmer the higher the less RAM you need, because normally the de brujin graph will be smaller. With kmer 49 the overlap between reeds must be 49-1bp! If you want more details about the algorithms you could get daniel zerbinos phd thesis, it is easy to read and for the understanding it helps a lot.
                  assuming i have only 1 contig in length 50.
                  if i am using kmer 49, my de brujin graph will be in size 1.
                  but if i am using kmer 19 it will be much bigger...

                  what did you mean when you said it become smaller?
                  (it become smaller just in the number of overlaps.. but in it size it becoming bigger)

                  thanks..

                  Comment


                  • #10
                    well not really. You have 1 READ with length 50! But you should think in HIGH numbers and there it's different. ;-)

                    for a kmer 3 there are 4^3 possibilities of nodes for the de Brujin graph: AAA, AAG, AGG, GGG, GAG, GAC, GCC.....

                    for higher kmers like 49 you have 4^49, normally you never reach the maximum of nodes for such high kmers. So less nodes compared to kmer 19, less overlaps, less junctions => smaller de brujin graph => easier to calculate. Problem is you will miss transcripts with a low coverage because the reads won't overlap with 48bp.

                    Comment


                    • #11
                      now i am really confused.....
                      you said: the higher the kmer the less RAM you need, graph become smaller.
                      and now you said: less nodes compared to kmer 19, less overlaps, less junctions => smaller de brujin graph => easier to calculate.

                      so when the graph become smaller?
                      when i have less overlaps, less junctions ,smaller de brujin graph ?

                      Comment


                      • #12
                        49 kmer compared to kmer 19. You really should read a bit more about the algorithm. :-/ 19 is a really LOW kmer you normally choose higher kmers but of course this depends on your read length.

                        Comment


                        • #13
                          now i understood!
                          cheers mate!

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          9 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          51 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X