SEQanswers (
-   Bioinformatics (
-   -   mapping 454 reads to a reference genome (

query 02-19-2009 02:00 PM

mapping 454 reads to a reference genome
What is the best tool available to map 454 reads to a reference genome? What is the method used by gs reference Mapper (analysis tool that comes with 454) and does it do a decent job of mapping and identifying variants?

dan 06-19-2009 01:20 AM

I'd like to know the answer to the exact same question. Anyone got any benchmarks for mapping 454 data? References?

Am I right in thinking that only the GS Reference Mapper uses the correct error model for the 454 data? Does anyone know how they actually do that? i.e. How is the alignment done in flowspace?

Actually, from the manual: "read to reference alignments are made in nucleotide space, the consensus basecalling and quality value determination for contigs are performed in flowspace."

So the questions are:

How is the consensus sequence created?
(How) could the mapping be improved by using flowspace, and does any software do that?

454investigator 07-07-2009 06:28 PM

Mapping 454 reads to a reference genome
I am also trying to find an answer to this question. There are many assemblers out there, but only one mapper?

Tuxido 07-07-2009 11:21 PM

We've been using the gsMapper for mapping human DNA to the hg18 reference genome. Mainly because of ease of use, and the fact that most of the mappers available are for short reads. All I can say that we got good results from mapping. We sometimes did a blast search on specific reads to check and these all checked out.
We compared it briefly with the mapping from CLC bio and saw no major differences but we didn't look into it that much.

In any run we can usually map about 99% of the bases.

Roald 07-08-2009 01:29 AM

454 mapping benchmark
Some benchmarks of how the CLC bio assembler maps 454 reads can be found in the white paper at


Roald Forsberg

Disclaimer: I work at CLC bio

chrisbala 08-17-2010 10:57 AM

454 mapping Lastz
I've used Galaxy a little bit for read mapping. (search around here for more explanation about what Galaxy is, if you don't know).

But within Galaxy they implement an algorithm called Lastz for mapping 454 reads.

I don't know much about Lastz except that it was not particularly useful for my data which is mRNA (results in alignment gaps due to splicing..).

aleferna 08-17-2010 12:33 PM

I've tested a few algorithms and the best one so far for big 454 reads has been bwa bwasw ( the long read aligner ). I would suggest though to use z = 10 or 100 if you have a big enough computer. There is another one that is highly recommended (by the maker of BWA) called Novoalign, but I haven't tried that one. BWA is easy to run, its fast and does a pretty good job as far as accuracy and specificity. It will NOT map reads smaller than 30bp, for that you have to use the bwa short aligner. Since my 454 run had reads of many sizes and in particular short reads i had to use both.

lh3 08-17-2010 01:53 PM

On the downside, bwasw does not use paired-end information and it does not report more than one hit. 1000 genomes project is using ssaha2 as speed is not a big concern given relatively fewer 454 reads in the project. I have not used novoalign for 454. Its 454 mapping is a new function added not long ago, but I trust the novoalign developer to deliver a good product.


Thanks for your patches. I have not got time to commit them. I will definitely do this when I want to modify BWA again.

aleferna 08-17-2010 09:45 PM

The first time I ran BWA with the long aligner I didn't realize that there was a short/long option and since I have both in my library I was very disappointed of BWA. I started testing algorithm after algorithm and finally reviewed BWA again. This time I made a small script that will just join 2 sam files, one for the small aligner and one from the long aligner. It will choose the alignment from the short aligner if it cannot find it in the long aligner, this was the winning combination.

I've mentioned this chart in another thread, but here you can see that BWA is the only one that can cover the full range of read sizes in 454 datasets (or in 100bp solexa data after you remove the pair end adapters!)

Moreover, I know using the Z=100 seems a bit of an overkill but with 454 data and a decent computer BWA will take just a few minutes and I did measure Z=1,10,25,50,100,250 and even 500. Z = 100 seems to be the peak, after this I cannot squeeze any specificity out of the algorithm, but you do see a change from Z=10 to Z=100.

maubp 08-18-2010 02:10 AM

Has anyone compared MIRA3 in mapping mode? That can map 454 reads onto a reference.

lh3 08-18-2010 05:23 AM


Do you simulate indels? As I tried blat in the bwa-sw paper, it was far less accurate on 200bp simulated reads with 5% error rate. The table in the paper has -fastMap switched on, but the default is not much better. I do not want to claim anything bad for blat if that is my fault!

EDIT: This is the table:


EDIT2: I think I might have figured out the difference. BLAT outputs multiple hits. Which hit are you choosing? Do you claim BLAT map the read correctly if one of the hits is correct, or only if the top-scored hit is correct? I use the latter. The script for converting PSL to SAM and calculating the BLAT alignment score (SAM AS tag):

aleferna 08-18-2010 06:04 AM


No I didn't simulate indels in my study, that might be the difference, but also the % of deletions and insertions vs mutations that I used simulate the characteristics of my own dataset. This dataset has a lot of deletions and insertions that are actually caused by the homopolymer problem in 454. I think though, that blat should be as sensitive to 5 bp gaps as a 1bp gap. I mean if you have even a 1bp gap none of the tiles that overlap will spawn an alignment. Therefore I don't think that indels should make that much difference in blat. The other difference is also that I have a different mapping threshold for each read size.

I'm just a humble master student, but if you ask me the -fastMap option is a big big game changer. if you run with default options it is much better at higher error rates, low error rates remain pretty much the same. Nevertheless Blat can be a sensitive as you want. You can even use stepSize=1 and tileSize=8 to get amazing accuracy, but of course it did take about 30,000 CPU hours to finish through one of my libraries (had to use over 600 nodes on that one). Actually Blat with high sens is what I use to validate BWA.

So I think Blat IS more sensitive than BWA if you have a 75TFlop cluster. I read your BWA paper and I think you cannot really compare BWA and Blat because they operate in very different time scales. What you can say is that BWA gives you 99% accuracy / 90% sensitivity N times faster.

lh3 08-18-2010 06:45 AM

How do you deal with multiple hits from the BLAT output? Would you say blat gives the correct answer as long as one of the hits is correct?

aleferna 08-18-2010 06:56 AM

No, I implemented a MapQ score: BestScore - 2ndBestScore (I don't divide by total score like you, this seems to improve the accuracy). I then optimized the minimum mapq score (you use a fixed one of 10 I think) per read length and error rate by calculating the lowest score that assures a 99% specificity and then I report the sensitivity with that threshold.

lh3 08-18-2010 07:06 AM

Which score are you using? Is it the first column in the PSL (the number of matching bases)?

aleferna 08-18-2010 07:12 AM

Nop, I calculate score as matches - mismatches - gapbases

aleferna 08-18-2010 07:13 AM

also if I can venture a suggestion, when I implemented the blat mapq value using S1 - S2, I noticed a big effect on the MinMapQ the MapQ threshold needed to achieve 99% specificity, as you have longer reads you need a lower thresholds, but also as your error rate decreases you need a lower threshold as well. I've been wondering that instead of using the S1 to normalize the value, if you could normalize by some sort of combination between the error rate and the read length. I think you can better approximate MapQ by combining these 2 components rather than trying to summarize them in S1.

aleferna 08-18-2010 07:21 AM

1 Attachment(s)
Here's the behavior of a simple Blat MapQ value

lh3 08-18-2010 07:55 AM

I gradually recall the decision on choosing the parameters for blat. My focus was more on >=500bp reads. And for these reads, blat -fastMap is similar to blat deault in accuracy but tens of times faster. However, for shorter reads which you are focusing on, blat default is much more accurate than blat -fastMap (still much slower, though). Your table would largely agree with mine for blat default.

Actually for 454, I would highly recommend ssaha2. Ssaha2 is designed for mapping sequencing data and calling SNPs from the first day and has been thoroughly validated. Blat, although being one of the best tools for mapping ESTs, is not for SNP finding initially and is not heavily evaluated. From what I have heard, blat does not refine the final alignment, which may make gaps positioned suboptimally and pose problems to indel finding. The default blat mode is also much slower and less accurate than ssaha2. In my view, it is a common mistake to overlook the superiority of ssaha2 for longer reads. The 1000 genomes project chooses every program for a reason.

aleferna 08-18-2010 08:46 AM


Sorry if I'm spamming you, I don't understand how the private messages work here. Send me a message to afer at I can send you the script to joing the 2 BWA files if you still want to use bwa.

All times are GMT -8. The time now is 04:14 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.