Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low mapping percentage of reads on assembled contigs

    Hello experts,

    I posted similar question last year but I haven't figured out the problem. I went through old posts too to see if I could find the answer but couldn't. Any suggestion or idea on my problem would be greatly appreciated.

    So I have Illumina HiSeq 100 bp PE reads (library insert size is 200 bp). After QC of reads, I assembled reads using velvet. The average contig size is about 250 bp. Anyway, I tried to map the reads on the assembled contigs using bowtie and only less than 10% of reads were mapped on the assemblies. I just observed that "contig.fa" file generated from the velvet assembly contains very short length of contigs which are between 40 - 60 bp. I realized that I used read length cutoff of 10 bp during QC meaning I kept all reads longer than 10 bp. Maybe these short reads were used for the assembly and generated weird contigs resulting in low mapping percentage? This maybe also a reason for short average contig length?

  • #2
    bowtie does not appear to map off the ends of contigs; with short contigs, a large percentage of the reads will not map simply because they are not fully contained. BBMap will map off the ends, though.

    The short contigs probably did not come from short reads; Velvet does not assemble from reads, but from kmers. Reads of length 10 are shorter than the kmer length you used (I have no idea what it was, but I'm sure it was longer than 10) so they would have been ignored. It's generally wise to do a 'kmer sweep' - multiple assemblies at different kmer lengths - to determine which is best for your data.

    How much data do you have, what kind of genome is it, and what is the expected genome size? What's the origin - is it isolate, metagenomic, amplified single cell, other...? Also, how did you perform QC? You may have been overly aggressive in trimming... I find Q6 to Q10 works well, but it depends on the trimmer. Most are not optimal and the cutoff means different things to different trimmers. You can also try normalization and error correction, which help in some cases, depending on the type of data. There are lots of options but we need more information.

    Comment


    • #3
      You can get velvet to omit very small contigs from the final contigs.fa file by setting the -min_contig_length parameter.

      Comment


      • #4
        Thanks all for the suggestion.

        Brian, to follow up some of your questions,
        I have environmental virus metagenomes. Average virus genome size is thought to be about 2-3 kb. I have about 10 million reads per sample. I used kmer range of 19 to 35 to find the best one.

        Any other ideas on what might cause low mapping percentage? Thanks.

        Comment


        • #5
          You need to try longer kmers to see if you can get velvet to give you longer contigs.

          Comment


          • #6
            Thanks mastal!

            When I tried kmer of 19-35 bp, I observed that kmer of 25 looked best so I did not try with longer kmer length anymore. I will try with longer kmer now.

            Comment


            • #7
              Originally posted by morning latte View Post
              Thanks all for the suggestion.

              Brian, to follow up some of your questions,
              I have environmental virus metagenomes. Average virus genome size is thought to be about 2-3 kb. I have about 10 million reads per sample. I used kmer range of 19 to 35 to find the best one.

              Any other ideas on what might cause low mapping percentage? Thanks.
              Bad assemblies (incomplete or highly fragmented) cause low mapping percentages. For bacterial metagenomes (I have no experience with viral), which often have an exponential distribution of organism abundance, we have found subsampling or normalization to be helpful - first you remove the low-frequency reads and assemble the dominant strains, then assemble the reads that don't map to the first assembly. Velvet really likes an even roughly 40x coverage, and metagenomes are not like that.

              If your metagenome is too diverse, you may simply not have enough coverage for assembly. I suggest you post the kmer frequency histogram of your data. You can get it like this (with the bbmap package):

              khist.sh in=reads.fq hist=histogram.txt -Xmx29g

              ..where "29g" should be about 85% of your physical memory. That package also includes normalization and error-correction tools.

              Comment


              • #8
                Thank you Brian Bushnell.

                I might be too new to fully understand what you asked about kmer frequency histogram. I used quality trimmed file to generate a histogram below. Please let me know if I misunderstood anything. Thank you.
                Attached Files

                Comment


                • #9
                  I wouldn't use Velvet for metagenomic samples. I'd recommend IDBA-UD or Ray instead. Perhaps Spades too even though that was not develoiped with metagenomics in mind. You might want to give them a shot.

                  Comment


                  • #10
                    I am having similar problems with Bowtie, however I used IDBA.

                    I have an IDBA-UD assembly with a total length of 600 Mbp. This was generated from 284 M paired reads of ~140 bp each.

                    I assume that this means I should have ~ 79,500 Mbp assembling to produce 600 Mbp, so I should have an average coverage of ~130x.

                    I noticed that my calculated coverage over contigs was low, and then confirmed visually using Samtools tview command that parts of my assembly have no coverage at all.

                    I think this is important: each line of a SAM file is a read (ignoring the header). Whether they align or not, I would think that I should have the same number of lines in my SAM file as I do reads, right? My input contained 568 M reads, yet I only have 142 M lines in my SAM file. Of these 142 M reads, around half (75 M) don't map to the bacterial metagenome.

                    Can anyone tell me why my SAM file isn't bigger? And why are so many reads not aligning to the metagenome anywhere?

                    Comment


                    • #11
                      Originally posted by morning latte View Post
                      Thank you Brian Bushnell.

                      I might be too new to fully understand what you asked about kmer frequency histogram. I used quality trimmed file to generate a histogram below. Please let me know if I misunderstood anything. Thank you.
                      That's exactly what I was looking for, but it's more useful if you plot it on a log-log scale and take the X axis out to at least 10000.

                      A "good" graph will look more like the attachment. It shows a strong peak at around 200, meaning on average it has 200x coverage of the genome. The stuff below 16x is (in that case) all error kmers. The assemble-able stuff should be at least 10x. So if you post the a larger, log-log version of the histogram, I can better diagnose what the problem is. If that IS your complete histogram, then the problem appears to be too little coverage, but it's hard to be sure without using a log-log scale.
                      Attached Files

                      Comment


                      • #12
                        Originally posted by AndrewRGross View Post
                        I am having similar problems with Bowtie

                        ...

                        I think this is important: each line of a SAM file is a read (ignoring the header). Whether they align or not, I would think that I should have the same number of lines in my SAM file as I do reads, right? My input contained 568 M reads, yet I only have 142 M lines in my SAM file. Of these 142 M reads, around half (75 M) don't map to the bacterial metagenome.

                        Can anyone tell me why my SAM file isn't bigger? And why are so many reads not aligning to the metagenome anywhere?
                        Try a different aligner? Bowtie is old and not designed for such long reads, and certainly not designed for metagenomics (in which there are many short contigs). I can't explain why the sam file doesn't contain all the reads, though, unless it is truncated because (for example) bowtie crashed.

                        Comment


                        • #13
                          Thank you Brian!

                          I changed the chart according to your suggestion and attached it. Please take a look at it. Thanks again.
                          Attached Files

                          Comment


                          • #14
                            Well, I don't quite understand why that graph has 2 distinct series for kmer count, seemingly alternating at odd and even depths. But if that's an accurate kmer histogram, then you just don't have enough coverage to assemble. You have very few kmers at depth 30+ which is really what Velvet needs for a good assembly, while most of your data is unassembleable. So I would suggest requesting more data; you will never get a good assembly out of what you have.

                            If you can't get any more data, then you might get better results with no quality trimming, or only trimming to Q5 (since you'll have more data), error-correcting, and throwing away reads with median kmer counts below ~5, which will remove a lot of noise:

                            bbnorm.sh in=reads.fq out=normalized.fq trimq=5 qtrim=rl target=50 mindepth=5 ecc=t -Xmx29g

                            That's available here.

                            Comment


                            • #15
                              Thank you Brian!

                              I just realized that I mistakenly used normalized reads for generating kmer frequency histogram. I am sorry about that. I generated another histogram with raw reads which were not even quality controlled. Can you take a look at it?
                              Attached Files

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X