Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Choice of hash length k in velvet

    Hi all, kindly helps needed.
    I am new in bioinformatic field and currently using velvet for de novo assembly of my transcriptome data from Illumina GA.
    My data are 35bp reads, single end with ~11M reads.

    From velvet manual stated that empirical tests with different values for k are not that costly to run, I have run using hash length from 15 to 31, but have no idea on how we determine which hash length is the best from these tests?

    Here's some brief results.
    K # contigs N50
    15 175304 37
    17 109252 72
    19 65707 92
    21 41651 107
    23 25854 127
    25 15923 155
    27 9797 185
    29 5641 231
    31 2865 269


    Thanks for all the helps n ideas.

  • #2
    Blast to related species' transcriptome?

    You could also run velvet with -amos_file yes, and convert to ace, and view the assemblies, to get a feel for what they look like and see if you're comfortable with the assemblies.

    Finally, velvet seems to work best with kmer coverages from 20-30X ... your data set may not get up to that (depending on the organism), but if it's much higher than that, you might consider sub-sampling to bring the kmer coverage down ... oddly enough, this may help your assembly.

    Comment


    • #3
      Yup, I did the blast and view the assemblies before but no significant information that I can extract from the results. Or maybe I am not good enough in analyze the results.

      Sorry can I know how to calculate the k-mer coverage from transcriptome data?

      Comment


      • #4
        yah - it's all gray area (no clear lines) with blast results from assemblies

        So, if you have a transcriptome size estimate, you want enough reads to have ~20-30X kmer coverage, as described here:


        Too low is obviously bad, but I've also found that extremely high kmer coverage can kill an assembly ... but that's probably over 100-500X ...

        There are other discussions of how to calculate kmer cov on seqanswers .. but let me know if it's not clear ...

        Comment


        • #5
          Hi,

          I’m struggling with a similar problem. I’ve got very high coverage (>6000, this is taken from the contig names in the velvet contig.fa file). Does this mean that my assembly is not optimal? What is mend by “subsampling” ? Dividing my reads in different subsets, do separate assemblies and then try and assemble the contigs into longer contigs?

          Thanks

          Comment


          • #6
            Hi Bea. Yes- if you randomly pick a smaller number of your reads, corresponding to lower coverage, then assemble ... and I would probably generate several (5? 10?) random subsamples, and assemble each, for statistical purposes (though, then you have to compare them somehow).

            A clear case in which I've seen this is with phiX. I tried to assemble the control lane of phiX reads from one of our Illumina runs, and got a terrible assembly (N50 < 100?). Then, after subsampling down to ~ 20-30X kmer coverage, velvet assembled phiX174 perfectly, in one contig.

            I'm not sure if other assemblers have this problem (Mira's author seems to think that could be the case), or whether it's a general issue or specific to an assembler's algorithm.

            Comment


            • #7
              Also, Bea, note that when you see a coverage value in a velvet contig name, that's k-mer coverage ... and the length is in k-mers as well.

              Comment


              • #8
                I had come through those threads before but don't know how to calculate the coverage and thus can't get the k-mer coverage from the formula.

                As discussed here,
                http://seqanswers.com/forums/showthr...overage+velvet

                It included the reference sequence for calculating coverage. But isn't velvet a de novo assembler without using reference?

                So, more helps needed on calculating the coverage. Thanks.

                Another problem is that most are discussed about the genomic data, is there any differences between transcriptomic data and genomic in calculating their coverage?
                Last edited by magick; 08-13-2009, 07:42 PM. Reason: url not showing

                Comment


                • #9
                  @magick: If you have some estimate of the size of the genome you're trying to assemble, that might be the best you can do. Of course, a run through velvetg without any parameters specified will result in some statistics (in the stats.txt file) that can help you estimate the coverage, as described in the manual.

                  Comment


                  • #10
                    the coverage cutoff will also have a huge effect on total coverage (i.e. assembly length) and contig count. Make sure you explore that setting from 2x-10x (measured in kmers)
                    --
                    Jeremy Leipzig
                    Bioinformatics Programmer
                    --
                    My blog
                    Twitter

                    Comment


                    • #11
                      If my velvet contigs have very broad cov distribution from 2 to 6000, is subsetting data necessary to improve the assembly.

                      Thanks,

                      Justin

                      Comment


                      • #12
                        Hey,

                        i hope that helps.

                        Theauthors of Velvet recommend to choose k as: E(X) = C * ((l - k + 1) / l),
                        where E(X) = number X of times a k-mer in a genome of length G
                        is observed in a set of n reads of length l , where
                        C = n * l/G=coverage. Choose k odd and larger than 10.


                        best,


                        phil

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        34 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 08:48 AM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-01-2024, 06:45 AM
                        0 responses
                        45 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-27-2024, 06:37 PM
                        0 responses
                        32 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X