![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Bowtie2: partial mapping of assembled contigs? | MartinH | Bioinformatics | 8 | 11-04-2013 11:33 AM |
RNA-Seq: Low mapping percentage of pair-end reads (length 75bp) | wilson90 | Bioinformatics | 6 | 03-21-2013 09:31 AM |
Low mapping percentage with TopHat2 | DerSeb | RNA Sequencing | 3 | 06-05-2012 06:35 AM |
the low percentage of mapped reads | wangli | RNA Sequencing | 2 | 04-04-2012 06:56 AM |
low percentage of reads mapped | rahilsethi | SOLiD | 3 | 09-13-2010 07:01 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: MI Join Date: Jun 2013
Posts: 91
|
![]()
Hello experts,
I posted similar question last year but I haven't figured out the problem. I went through old posts too to see if I could find the answer but couldn't. Any suggestion or idea on my problem would be greatly appreciated. So I have Illumina HiSeq 100 bp PE reads (library insert size is 200 bp). After QC of reads, I assembled reads using velvet. The average contig size is about 250 bp. Anyway, I tried to map the reads on the assembled contigs using bowtie and only less than 10% of reads were mapped on the assemblies. I just observed that "contig.fa" file generated from the velvet assembly contains very short length of contigs which are between 40 - 60 bp. I realized that I used read length cutoff of 10 bp during QC meaning I kept all reads longer than 10 bp. Maybe these short reads were used for the assembly and generated weird contigs resulting in low mapping percentage? This maybe also a reason for short average contig length? |
![]() |
![]() |
![]() |
#2 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
bowtie does not appear to map off the ends of contigs; with short contigs, a large percentage of the reads will not map simply because they are not fully contained. BBMap will map off the ends, though.
The short contigs probably did not come from short reads; Velvet does not assemble from reads, but from kmers. Reads of length 10 are shorter than the kmer length you used (I have no idea what it was, but I'm sure it was longer than 10) so they would have been ignored. It's generally wise to do a 'kmer sweep' - multiple assemblies at different kmer lengths - to determine which is best for your data. How much data do you have, what kind of genome is it, and what is the expected genome size? What's the origin - is it isolate, metagenomic, amplified single cell, other...? Also, how did you perform QC? You may have been overly aggressive in trimming... I find Q6 to Q10 works well, but it depends on the trimmer. Most are not optimal and the cutoff means different things to different trimmers. You can also try normalization and error correction, which help in some cases, depending on the type of data. There are lots of options but we need more information. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]()
You can get velvet to omit very small contigs from the final contigs.fa file by setting the -min_contig_length parameter.
|
![]() |
![]() |
![]() |
#4 |
Member
Location: MI Join Date: Jun 2013
Posts: 91
|
![]()
Thanks all for the suggestion.
Brian, to follow up some of your questions, I have environmental virus metagenomes. Average virus genome size is thought to be about 2-3 kb. I have about 10 million reads per sample. I used kmer range of 19 to 35 to find the best one. Any other ideas on what might cause low mapping percentage? Thanks. |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]()
You need to try longer kmers to see if you can get velvet to give you longer contigs.
|
![]() |
![]() |
![]() |
#6 |
Member
Location: MI Join Date: Jun 2013
Posts: 91
|
![]()
Thanks mastal!
When I tried kmer of 19-35 bp, I observed that kmer of 25 looked best so I did not try with longer kmer length anymore. I will try with longer kmer now. |
![]() |
![]() |
![]() |
#7 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
If your metagenome is too diverse, you may simply not have enough coverage for assembly. I suggest you post the kmer frequency histogram of your data. You can get it like this (with the bbmap package): khist.sh in=reads.fq hist=histogram.txt -Xmx29g ..where "29g" should be about 85% of your physical memory. That package also includes normalization and error-correction tools. |
|
![]() |
![]() |
![]() |
#8 |
Member
Location: MI Join Date: Jun 2013
Posts: 91
|
![]()
Thank you Brian Bushnell.
I might be too new to fully understand what you asked about kmer frequency histogram. I used quality trimmed file to generate a histogram below. Please let me know if I misunderstood anything. Thank you. |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Stockholm, Sweden Join Date: Feb 2008
Posts: 319
|
![]()
I wouldn't use Velvet for metagenomic samples. I'd recommend IDBA-UD or Ray instead. Perhaps Spades too even though that was not develoiped with metagenomics in mind. You might want to give them a shot.
|
![]() |
![]() |
![]() |
#10 |
Member
Location: Los Angeles Join Date: Sep 2013
Posts: 16
|
![]()
I am having similar problems with Bowtie, however I used IDBA.
I have an IDBA-UD assembly with a total length of 600 Mbp. This was generated from 284 M paired reads of ~140 bp each. I assume that this means I should have ~ 79,500 Mbp assembling to produce 600 Mbp, so I should have an average coverage of ~130x. I noticed that my calculated coverage over contigs was low, and then confirmed visually using Samtools tview command that parts of my assembly have no coverage at all. I think this is important: each line of a SAM file is a read (ignoring the header). Whether they align or not, I would think that I should have the same number of lines in my SAM file as I do reads, right? My input contained 568 M reads, yet I only have 142 M lines in my SAM file. Of these 142 M reads, around half (75 M) don't map to the bacterial metagenome. Can anyone tell me why my SAM file isn't bigger? And why are so many reads not aligning to the metagenome anywhere? |
![]() |
![]() |
![]() |
#11 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
A "good" graph will look more like the attachment. It shows a strong peak at around 200, meaning on average it has 200x coverage of the genome. The stuff below 16x is (in that case) all error kmers. The assemble-able stuff should be at least 10x. So if you post the a larger, log-log version of the histogram, I can better diagnose what the problem is. If that IS your complete histogram, then the problem appears to be too little coverage, but it's hard to be sure without using a log-log scale. |
|
![]() |
![]() |
![]() |
#12 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#13 |
Member
Location: MI Join Date: Jun 2013
Posts: 91
|
![]()
Thank you Brian!
I changed the chart according to your suggestion and attached it. Please take a look at it. Thanks again. |
![]() |
![]() |
![]() |
#14 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Well, I don't quite understand why that graph has 2 distinct series for kmer count, seemingly alternating at odd and even depths. But if that's an accurate kmer histogram, then you just don't have enough coverage to assemble. You have very few kmers at depth 30+ which is really what Velvet needs for a good assembly, while most of your data is unassembleable. So I would suggest requesting more data; you will never get a good assembly out of what you have.
If you can't get any more data, then you might get better results with no quality trimming, or only trimming to Q5 (since you'll have more data), error-correcting, and throwing away reads with median kmer counts below ~5, which will remove a lot of noise: bbnorm.sh in=reads.fq out=normalized.fq trimq=5 qtrim=rl target=50 mindepth=5 ecc=t -Xmx29g That's available here. |
![]() |
![]() |
![]() |
#15 |
Member
Location: MI Join Date: Jun 2013
Posts: 91
|
![]()
Thank you Brian!
I just realized that I mistakenly used normalized reads for generating kmer frequency histogram. I am sorry about that. I generated another histogram with raw reads which were not even quality controlled. Can you take a look at it? |
![]() |
![]() |
![]() |
#16 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
The main peak is still below 10x, which will make assembly very fragmented. But this data looks a lot better than the normalized data. I don't know which normalizer you used, or what settings, but the output came out very strange - I would expect the raw reads to give a better assembly than the normalized ones.
|
![]() |
![]() |
![]() |
#17 |
Member
Location: MI Join Date: Jun 2013
Posts: 91
|
![]()
Thank you Brian. I appreciate for your all advices.
I have the last question. I am not still clear how you interpret the kmer frequency histogram. Could you explain a bit more on how to interpret the histogram? Thank you. |
![]() |
![]() |
![]() |
#18 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
The kmer frequency histogram tells you the depth of coverage of the target genome. This is more complicated in a metagenome, but if you have a single bacterial genome with no repeats, then with very even coverage, you would get a single peak in your graph. With repeat content, there will be additional higher peaks. You get the best assembly when this primary peak is narrow and above some minimum depth. Velvet does well when the primary peak is around 30x to 40x coverage (that's the X axis).
If most of your kmers occur fewer than 10 times or so, you can't assemble them very well. |
![]() |
![]() |
![]() |
#19 |
Member
Location: Los Angeles Join Date: Sep 2013
Posts: 16
|
![]()
First off, I discovered that I'd counted my initial reads wrong, so my .sam file has the right number of reads afterall.
Instead of bowtie, I tried mapping with bbmap. On my first attempt, I reduced the percentage of unmapped reads from 50% to 29%. I'm planning on running bbmap again with a more sensitive set of input parameters. Annoyingly, this recent sam file produced a bam file which samtools' index command seems unable to index. It keeps stopping and outputting "Killed". |
![]() |
![]() |
![]() |
#20 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
If you want to run bbmap with greater sensitivity, you can use the "slow" flag, and reduce the "minratio". The default is "minratio=0.56", which allows mappings with scores down to 56% of the max. Also, if you want to map really low quality reads, you can set "qtrim=rl trimq=10 untrim". This will quality-trim reads to q10 before mapping, then undo the trimming after mapping (trimmed bases will be soft-clipped). The "local" flag will also slightly improve mapping rates by doing local rather than global alignments. |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|