Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • magick
    Junior Member
    • Jul 2009
    • 5

    Choice of hash length k in velvet

    Hi all, kindly helps needed.
    I am new in bioinformatic field and currently using velvet for de novo assembly of my transcriptome data from Illumina GA.
    My data are 35bp reads, single end with ~11M reads.

    From velvet manual stated that empirical tests with different values for k are not that costly to run, I have run using hash length from 15 to 31, but have no idea on how we determine which hash length is the best from these tests?

    Here's some brief results.
    K # contigs N50
    15 175304 37
    17 109252 72
    19 65707 92
    21 41651 107
    23 25854 127
    25 15923 155
    27 9797 185
    29 5641 231
    31 2865 269


    Thanks for all the helps n ideas.
  • jnfass
    Member
    • Aug 2008
    • 88

    #2
    Blast to related species' transcriptome?

    You could also run velvet with -amos_file yes, and convert to ace, and view the assemblies, to get a feel for what they look like and see if you're comfortable with the assemblies.

    Finally, velvet seems to work best with kmer coverages from 20-30X ... your data set may not get up to that (depending on the organism), but if it's much higher than that, you might consider sub-sampling to bring the kmer coverage down ... oddly enough, this may help your assembly.

    Comment

    • magick
      Junior Member
      • Jul 2009
      • 5

      #3
      Yup, I did the blast and view the assemblies before but no significant information that I can extract from the results. Or maybe I am not good enough in analyze the results.

      Sorry can I know how to calculate the k-mer coverage from transcriptome data?

      Comment

      • jnfass
        Member
        • Aug 2008
        • 88

        #4
        yah - it's all gray area (no clear lines) with blast results from assemblies

        So, if you have a transcriptome size estimate, you want enough reads to have ~20-30X kmer coverage, as described here:


        Too low is obviously bad, but I've also found that extremely high kmer coverage can kill an assembly ... but that's probably over 100-500X ...

        There are other discussions of how to calculate kmer cov on seqanswers .. but let me know if it's not clear ...

        Comment

        • bea
          Junior Member
          • May 2009
          • 3

          #5
          Hi,

          I’m struggling with a similar problem. I’ve got very high coverage (>6000, this is taken from the contig names in the velvet contig.fa file). Does this mean that my assembly is not optimal? What is mend by “subsampling” ? Dividing my reads in different subsets, do separate assemblies and then try and assemble the contigs into longer contigs?

          Thanks

          Comment

          • jnfass
            Member
            • Aug 2008
            • 88

            #6
            Hi Bea. Yes- if you randomly pick a smaller number of your reads, corresponding to lower coverage, then assemble ... and I would probably generate several (5? 10?) random subsamples, and assemble each, for statistical purposes (though, then you have to compare them somehow).

            A clear case in which I've seen this is with phiX. I tried to assemble the control lane of phiX reads from one of our Illumina runs, and got a terrible assembly (N50 < 100?). Then, after subsampling down to ~ 20-30X kmer coverage, velvet assembled phiX174 perfectly, in one contig.

            I'm not sure if other assemblers have this problem (Mira's author seems to think that could be the case), or whether it's a general issue or specific to an assembler's algorithm.

            Comment

            • jnfass
              Member
              • Aug 2008
              • 88

              #7
              Also, Bea, note that when you see a coverage value in a velvet contig name, that's k-mer coverage ... and the length is in k-mers as well.

              Comment

              • magick
                Junior Member
                • Jul 2009
                • 5

                #8
                I had come through those threads before but don't know how to calculate the coverage and thus can't get the k-mer coverage from the formula.

                As discussed here,
                Hello, all Does anyone know what the coverage means in velvet ? According to the manual of velvet, the relation between k-mer coverage Ck and standard (nucleotide-wise) coverage C is Ck = C*(L-k+1)/L. Is C calculated as follows ? Read length: 36bp Number of reads: 50,000,000 X 2 (paired-end) Size of reference


                It included the reference sequence for calculating coverage. But isn't velvet a de novo assembler without using reference?

                So, more helps needed on calculating the coverage. Thanks.

                Another problem is that most are discussed about the genomic data, is there any differences between transcriptomic data and genomic in calculating their coverage?
                Last edited by magick; 08-13-2009, 07:42 PM. Reason: url not showing

                Comment

                • jnfass
                  Member
                  • Aug 2008
                  • 88

                  #9
                  @magick: If you have some estimate of the size of the genome you're trying to assemble, that might be the best you can do. Of course, a run through velvetg without any parameters specified will result in some statistics (in the stats.txt file) that can help you estimate the coverage, as described in the manual.

                  Comment

                  • Zigster
                    Jeremy Leipzig
                    • May 2009
                    • 116

                    #10
                    the coverage cutoff will also have a huge effect on total coverage (i.e. assembly length) and contig count. Make sure you explore that setting from 2x-10x (measured in kmers)
                    --
                    Jeremy Leipzig
                    Bioinformatics Programmer
                    --
                    My blog
                    Twitter

                    Comment

                    • zhangju
                      Member
                      • May 2011
                      • 18

                      #11
                      If my velvet contigs have very broad cov distribution from 2 to 6000, is subsetting data necessary to improve the assembly.

                      Thanks,

                      Justin

                      Comment

                      • sphil
                        Senior Member
                        • Apr 2010
                        • 192

                        #12
                        Hey,

                        i hope that helps.

                        Theauthors of Velvet recommend to choose k as: E(X) = C * ((l - k + 1) / l),
                        where E(X) = number X of times a k-mer in a genome of length G
                        is observed in a set of n reads of length l , where
                        C = n * l/G=coverage. Choose k odd and larger than 10.


                        best,


                        phil

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Pathogen Surveillance with Advanced Genomic Tools
                          by seqadmin




                          The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                          03-24-2025, 11:48 AM
                        • seqadmin
                          New Genomics Tools and Methods Shared at AGBT 2025
                          by seqadmin


                          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                          The Headliner
                          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                          03-03-2025, 01:39 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-20-2025, 05:03 AM
                        0 responses
                        49 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-19-2025, 07:27 AM
                        0 responses
                        57 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-18-2025, 12:50 PM
                        0 responses
                        50 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-03-2025, 01:15 PM
                        0 responses
                        201 views
                        0 reactions
                        Last Post seqadmin  
                        Working...