Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Genomics101
    Member
    • May 2012
    • 60

    De novo assembly using Velvet: any idea why such small Kmers with long reads?

    Greetings!

    I am doing some de novo genome assembly of a 23Mb genome using Velvet 1.2.10 and quality trimmed MiSeq (Illumina) reads that average about 180bp in length. I have assembled different individuals of this species before with 100bp reads and the kmer size always comes in around 61 for the best N50s and very good max. contig size. However with this assmebly I am getting really low kmer sizes as optimal (for N50s) in the the low/mid 30s. Velvet estimated the kmer coverage averaging 23X.

    Results here (Kmer size is on the x-axis, the left y-axis is for the max. contig size ("max," red line) and the left y-axis if for the N50s (blue line)):

    I am concerned about how good an assembly would be for such a large genome and such small kmer, and I also just wonder why -- with longer reads -- I need smaller kmers.

    Thanks!
    Last edited by Genomics101; 02-21-2014, 03:51 PM. Reason: spelling, added kmer coverage detail
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    Your coverage is too low. At 23x coverage with 100bp reads and k=60, you'll only get a kmer depth of 40% or around 10, which will give a fragmented assembly missing low-depth areas, and making it hard to distinguish valid and error kmers.

    Edit:

    Oops, I see now that the old assemblies were 100bp and the new ones are 180bp. Still, the point remains that as you increase K you decrease kmer depth, and you appear to have too little data for that to help. What's the insert size distribution and quality distribution? I've seen a lot of MiSeq libraries get made with insert sizes shorter than read length. So, also, you might consider adapter-trimming based on kmers before quality trimming.
    Last edited by Brian Bushnell; 02-21-2014, 02:57 PM.

    Comment

    • AdrianP
      Senior Member
      • Apr 2011
      • 130

      #3
      Originally posted by Genomics101 View Post
      Velvet estimated the coverage averaging 23X.
      Are you talking about kmer coverage? or nucleotide coverage?

      In other words, your biggest contigs, what is their cov value? (those show kmer coverage)

      Comment

      • Genomics101
        Member
        • May 2012
        • 60

        #4
        Kmer coverage

        Comment

        • AdrianP
          Senior Member
          • Apr 2011
          • 130

          #5
          Originally posted by Genomics101 View Post
          Kmer coverage
          Okay, than what the person in post #2 said doesn't apply, because they assumed nucleotide coverage. You need to go higher kmers. kmer coverage of higher than 20 is a waste, you need to aim between 10 and 20.

          Use VelvetOptimiser, and try kmers to 160-180, you might see a second peak in N50, this is common.

          Comment

          • Genomics101
            Member
            • May 2012
            • 60

            #6
            @Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.

            Comment

            • Genomics101
              Member
              • May 2012
              • 60

              #7
              @AdrianP Thanks for your reply, but I actually did kmers (with a larger gap between them ) all the way up to 191 as the initial analysis:

              Comment

              • AdrianP
                Senior Member
                • Apr 2011
                • 130

                #8
                Originally posted by Genomics101 View Post
                @Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.
                Okay, with a base coverage of 43X you do not want Velvet. DBG needs high coverage for repeat resolution.

                My advice, is to use SeqPrep to merge your reads. You should have 3 files, forward, reverse, and merged after using it. Feed those to the MIRA assembler, which is an OLC assembler, and I expect you to get better results.

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  #9
                  Originally posted by Genomics101 View Post
                  @Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.
                  Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?

                  Comment

                  • AdrianP
                    Senior Member
                    • Apr 2011
                    • 130

                    #10
                    Originally posted by Brian Bushnell View Post
                    Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?
                    To obtain that, I can recommend:


                    CMD:
                    ./kmergenie --diploid -k <higher_kmer> -e 1 -l <lower_kmer> -t <cpu_threads> -o <output_name> <read_location>

                    Start with higher 101, and lower 41, see what graphs.

                    Comment

                    • Genomics101
                      Member
                      • May 2012
                      • 60

                      #11
                      Originally posted by Brian Bushnell View Post
                      Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?

                      Comment

                      • Brian Bushnell
                        Super Moderator
                        • Jan 2014
                        • 2709

                        #12
                        Ah - what I actually mean is, a graph for a fixed kmer length (of, say, 31) where the X axis is depth and Y axis is number of kmers found at that depth, both log-scale. Ideally, you should have a sharp peak at some depth (maybe 40) and it should drop dramatically on the left and right.

                        My attachment shows the 31-mer frequency histogram for e.coli synthetic reads. You can see a main peak at about 200 and a few repeat peaks after that. If the data was real and had uneven coverage, there would be a broad peak rather than a sharp one.

                        FYI, I generated this with the 'khist.sh' script in the BBMap package and plotted it in Excel.
                        Attached Files
                        Last edited by Brian Bushnell; 02-21-2014, 04:33 PM.

                        Comment

                        • Genomics101
                          Member
                          • May 2012
                          • 60

                          #13
                          I also have the option of using the longer and more uniform untrimmed reads, but the quality is pretty questionable:



                          I tried doing an assembly with these and got better N50s at very high kmers (~99-135) but the kmer depth has a weird bell curve relationship with kmer size rather than a direct one. The lowest kmer I tried (45) had a coverage of only 1,2 and it the kmer coverage peaked at 41.5X at kmer =93 (also a relative good N50 at ~20kb). Also, I am very wary of using data with so many errors.

                          Comment

                          • Brian Bushnell
                            Super Moderator
                            • Jan 2014
                            • 2709

                            #14
                            Read length uniformity shouldn't matter to Velvet. Trimming to ~180bp seems like overkill for data of that quality; I would probably try trimming to something very conservative like Q10. Excessive trimming can also cause biases.

                            Comment

                            • Genomics101
                              Member
                              • May 2012
                              • 60

                              #15
                              Originally posted by Brian Bushnell View Post
                              Trimming to ~180bp seems like overkill for data of that quality; I would probably try trimming to something very conservative like Q10. .
                              Thanks. I didn't trim by length, but by quality (Q30 cut off) and the reads just came out with most of them at around. But doing a less strict trimming may be the answer.

                              Since I have your very helpful attention here, since I am doing the assmebly with the untrimmed reads, do you have a suggestion for a good way to assess how the sequencing errors are affecting the accuracy of the contigs? Should I just BLAST a few regions I have done with Sanger sequencing?

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              7 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...