SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie2: partial mapping of assembled contigs? MartinH Bioinformatics 8 11-04-2013 10:33 AM
RNA-Seq: Low mapping percentage of pair-end reads (length 75bp) wilson90 Bioinformatics 6 03-21-2013 08:31 AM
Low mapping percentage with TopHat2 DerSeb RNA Sequencing 3 06-05-2012 05:35 AM
the low percentage of mapped reads wangli RNA Sequencing 2 04-04-2012 05:56 AM
low percentage of reads mapped rahilsethi SOLiD 3 09-13-2010 06:01 AM

Reply
 
Thread Tools
Old 03-18-2014, 05:45 PM   #1
morning latte
Member
 
Location: MI

Join Date: Jun 2013
Posts: 91
Default Low mapping percentage of reads on assembled contigs

Hello experts,

I posted similar question last year but I haven't figured out the problem. I went through old posts too to see if I could find the answer but couldn't. Any suggestion or idea on my problem would be greatly appreciated.

So I have Illumina HiSeq 100 bp PE reads (library insert size is 200 bp). After QC of reads, I assembled reads using velvet. The average contig size is about 250 bp. Anyway, I tried to map the reads on the assembled contigs using bowtie and only less than 10% of reads were mapped on the assemblies. I just observed that "contig.fa" file generated from the velvet assembly contains very short length of contigs which are between 40 - 60 bp. I realized that I used read length cutoff of 10 bp during QC meaning I kept all reads longer than 10 bp. Maybe these short reads were used for the assembly and generated weird contigs resulting in low mapping percentage? This maybe also a reason for short average contig length?
morning latte is offline   Reply With Quote
Old 03-18-2014, 08:16 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

bowtie does not appear to map off the ends of contigs; with short contigs, a large percentage of the reads will not map simply because they are not fully contained. BBMap will map off the ends, though.

The short contigs probably did not come from short reads; Velvet does not assemble from reads, but from kmers. Reads of length 10 are shorter than the kmer length you used (I have no idea what it was, but I'm sure it was longer than 10) so they would have been ignored. It's generally wise to do a 'kmer sweep' - multiple assemblies at different kmer lengths - to determine which is best for your data.

How much data do you have, what kind of genome is it, and what is the expected genome size? What's the origin - is it isolate, metagenomic, amplified single cell, other...? Also, how did you perform QC? You may have been overly aggressive in trimming... I find Q6 to Q10 works well, but it depends on the trimmer. Most are not optimal and the cutoff means different things to different trimmers. You can also try normalization and error correction, which help in some cases, depending on the type of data. There are lots of options but we need more information.
Brian Bushnell is offline   Reply With Quote
Old 03-19-2014, 03:23 AM   #3
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

You can get velvet to omit very small contigs from the final contigs.fa file by setting the -min_contig_length parameter.
mastal is offline   Reply With Quote
Old 03-19-2014, 04:38 AM   #4
morning latte
Member
 
Location: MI

Join Date: Jun 2013
Posts: 91
Default

Thanks all for the suggestion.

Brian, to follow up some of your questions,
I have environmental virus metagenomes. Average virus genome size is thought to be about 2-3 kb. I have about 10 million reads per sample. I used kmer range of 19 to 35 to find the best one.

Any other ideas on what might cause low mapping percentage? Thanks.
morning latte is offline   Reply With Quote
Old 03-19-2014, 05:00 AM   #5
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

You need to try longer kmers to see if you can get velvet to give you longer contigs.
mastal is offline   Reply With Quote
Old 03-19-2014, 05:02 AM   #6
morning latte
Member
 
Location: MI

Join Date: Jun 2013
Posts: 91
Default

Thanks mastal!

When I tried kmer of 19-35 bp, I observed that kmer of 25 looked best so I did not try with longer kmer length anymore. I will try with longer kmer now.
morning latte is offline   Reply With Quote
Old 03-19-2014, 10:25 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by morning latte View Post
Thanks all for the suggestion.

Brian, to follow up some of your questions,
I have environmental virus metagenomes. Average virus genome size is thought to be about 2-3 kb. I have about 10 million reads per sample. I used kmer range of 19 to 35 to find the best one.

Any other ideas on what might cause low mapping percentage? Thanks.
Bad assemblies (incomplete or highly fragmented) cause low mapping percentages. For bacterial metagenomes (I have no experience with viral), which often have an exponential distribution of organism abundance, we have found subsampling or normalization to be helpful - first you remove the low-frequency reads and assemble the dominant strains, then assemble the reads that don't map to the first assembly. Velvet really likes an even roughly 40x coverage, and metagenomes are not like that.

If your metagenome is too diverse, you may simply not have enough coverage for assembly. I suggest you post the kmer frequency histogram of your data. You can get it like this (with the bbmap package):

khist.sh in=reads.fq hist=histogram.txt -Xmx29g

..where "29g" should be about 85% of your physical memory. That package also includes normalization and error-correction tools.
Brian Bushnell is offline   Reply With Quote
Old 03-19-2014, 12:39 PM   #8
morning latte
Member
 
Location: MI

Join Date: Jun 2013
Posts: 91
Default

Thank you Brian Bushnell.

I might be too new to fully understand what you asked about kmer frequency histogram. I used quality trimmed file to generate a histogram below. Please let me know if I misunderstood anything. Thank you.
Attached Images
File Type: png hist.png (53.2 KB, 45 views)
morning latte is offline   Reply With Quote
Old 03-19-2014, 01:05 PM   #9
kopi-o
Senior Member
 
Location: Stockholm, Sweden

Join Date: Feb 2008
Posts: 319
Default

I wouldn't use Velvet for metagenomic samples. I'd recommend IDBA-UD or Ray instead. Perhaps Spades too even though that was not develoiped with metagenomics in mind. You might want to give them a shot.
kopi-o is offline   Reply With Quote
Old 03-19-2014, 02:18 PM   #10
AndrewRGross
Member
 
Location: Los Angeles

Join Date: Sep 2013
Posts: 16
Default

I am having similar problems with Bowtie, however I used IDBA.

I have an IDBA-UD assembly with a total length of 600 Mbp. This was generated from 284 M paired reads of ~140 bp each.

I assume that this means I should have ~ 79,500 Mbp assembling to produce 600 Mbp, so I should have an average coverage of ~130x.

I noticed that my calculated coverage over contigs was low, and then confirmed visually using Samtools tview command that parts of my assembly have no coverage at all.

I think this is important: each line of a SAM file is a read (ignoring the header). Whether they align or not, I would think that I should have the same number of lines in my SAM file as I do reads, right? My input contained 568 M reads, yet I only have 142 M lines in my SAM file. Of these 142 M reads, around half (75 M) don't map to the bacterial metagenome.

Can anyone tell me why my SAM file isn't bigger? And why are so many reads not aligning to the metagenome anywhere?
AndrewRGross is offline   Reply With Quote
Old 03-19-2014, 04:19 PM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by morning latte View Post
Thank you Brian Bushnell.

I might be too new to fully understand what you asked about kmer frequency histogram. I used quality trimmed file to generate a histogram below. Please let me know if I misunderstood anything. Thank you.
That's exactly what I was looking for, but it's more useful if you plot it on a log-log scale and take the X axis out to at least 10000.

A "good" graph will look more like the attachment. It shows a strong peak at around 200, meaning on average it has 200x coverage of the genome. The stuff below 16x is (in that case) all error kmers. The assemble-able stuff should be at least 10x. So if you post the a larger, log-log version of the histogram, I can better diagnose what the problem is. If that IS your complete histogram, then the problem appears to be too little coverage, but it's hard to be sure without using a log-log scale.
Attached Images
File Type: jpg ecoli.jpg (76.3 KB, 47 views)
Brian Bushnell is offline   Reply With Quote
Old 03-19-2014, 08:27 PM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by AndrewRGross View Post
I am having similar problems with Bowtie

...

I think this is important: each line of a SAM file is a read (ignoring the header). Whether they align or not, I would think that I should have the same number of lines in my SAM file as I do reads, right? My input contained 568 M reads, yet I only have 142 M lines in my SAM file. Of these 142 M reads, around half (75 M) don't map to the bacterial metagenome.

Can anyone tell me why my SAM file isn't bigger? And why are so many reads not aligning to the metagenome anywhere?
Try a different aligner? Bowtie is old and not designed for such long reads, and certainly not designed for metagenomics (in which there are many short contigs). I can't explain why the sam file doesn't contain all the reads, though, unless it is truncated because (for example) bowtie crashed.
Brian Bushnell is offline   Reply With Quote
Old 03-20-2014, 04:34 AM   #13
morning latte
Member
 
Location: MI

Join Date: Jun 2013
Posts: 91
Default

Thank you Brian!

I changed the chart according to your suggestion and attached it. Please take a look at it. Thanks again.
Attached Images
File Type: png hist2.png (80.0 KB, 33 views)
morning latte is offline   Reply With Quote
Old 03-20-2014, 09:06 AM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Well, I don't quite understand why that graph has 2 distinct series for kmer count, seemingly alternating at odd and even depths. But if that's an accurate kmer histogram, then you just don't have enough coverage to assemble. You have very few kmers at depth 30+ which is really what Velvet needs for a good assembly, while most of your data is unassembleable. So I would suggest requesting more data; you will never get a good assembly out of what you have.

If you can't get any more data, then you might get better results with no quality trimming, or only trimming to Q5 (since you'll have more data), error-correcting, and throwing away reads with median kmer counts below ~5, which will remove a lot of noise:

bbnorm.sh in=reads.fq out=normalized.fq trimq=5 qtrim=rl target=50 mindepth=5 ecc=t -Xmx29g

That's available here.
Brian Bushnell is offline   Reply With Quote
Old 03-20-2014, 10:51 AM   #15
morning latte
Member
 
Location: MI

Join Date: Jun 2013
Posts: 91
Default

Thank you Brian!

I just realized that I mistakenly used normalized reads for generating kmer frequency histogram. I am sorry about that. I generated another histogram with raw reads which were not even quality controlled. Can you take a look at it?
Attached Images
File Type: png hist3.png (70.3 KB, 35 views)
morning latte is offline   Reply With Quote
Old 03-20-2014, 06:45 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The main peak is still below 10x, which will make assembly very fragmented. But this data looks a lot better than the normalized data. I don't know which normalizer you used, or what settings, but the output came out very strange - I would expect the raw reads to give a better assembly than the normalized ones.
Brian Bushnell is offline   Reply With Quote
Old 03-21-2014, 12:27 PM   #17
morning latte
Member
 
Location: MI

Join Date: Jun 2013
Posts: 91
Default

Thank you Brian. I appreciate for your all advices.

I have the last question. I am not still clear how you interpret the kmer frequency histogram. Could you explain a bit more on how to interpret the histogram? Thank you.
morning latte is offline   Reply With Quote
Old 03-21-2014, 03:04 PM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The kmer frequency histogram tells you the depth of coverage of the target genome. This is more complicated in a metagenome, but if you have a single bacterial genome with no repeats, then with very even coverage, you would get a single peak in your graph. With repeat content, there will be additional higher peaks. You get the best assembly when this primary peak is narrow and above some minimum depth. Velvet does well when the primary peak is around 30x to 40x coverage (that's the X axis).

If most of your kmers occur fewer than 10 times or so, you can't assemble them very well.
Brian Bushnell is offline   Reply With Quote
Old 03-23-2014, 08:28 AM   #19
AndrewRGross
Member
 
Location: Los Angeles

Join Date: Sep 2013
Posts: 16
Default

First off, I discovered that I'd counted my initial reads wrong, so my .sam file has the right number of reads afterall.

Instead of bowtie, I tried mapping with bbmap. On my first attempt, I reduced the percentage of unmapped reads from 50% to 29%. I'm planning on running bbmap again with a more sensitive set of input parameters.

Annoyingly, this recent sam file produced a bam file which samtools' index command seems unable to index. It keeps stopping and outputting "Killed".
AndrewRGross is offline   Reply With Quote
Old 03-23-2014, 08:49 AM   #20
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by AndrewRGross View Post
Instead of bowtie, I tried mapping with bbmap. On my first attempt, I reduced the percentage of unmapped reads from 50% to 29%. I'm planning on running bbmap again with a more sensitive set of input parameters.

Annoyingly, this recent sam file produced a bam file which samtools' index command seems unable to index. It keeps stopping and outputting "Killed".
The bam format does not really allow files with more than 2GB of header data. I don't know if that's the problem in your case, but if you do have millions of contigs, it's a possibility. You can often get around it by filtering out the contigs shorter than a certain length, which are often kind of useless anyway. I think the latest version of samtools is fixed to allow larger headers, but there's no assurance that downstream tools will be able to process the bam.

If you want to run bbmap with greater sensitivity, you can use the "slow" flag, and reduce the "minratio". The default is "minratio=0.56", which allows mappings with scores down to 56% of the max. Also, if you want to map really low quality reads, you can set "qtrim=rl trimq=10 untrim". This will quality-trim reads to q10 before mapping, then undo the trimming after mapping (trimmed bases will be soft-clipped). The "local" flag will also slightly improve mapping rates by doing local rather than global alignments.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO