SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
hg19 genome reference for short read mapping yh253 Bioinformatics 4 12-29-2013 09:11 PM
Assisted de novo genome assembly? Create new consensus mapping reads to reference? zmartine Bioinformatics 8 02-10-2012 12:31 AM
Transcript mapping without a reference genome gringer Bioinformatics 2 11-04-2011 08:49 AM
Mapping reference genome to ensembl identifier bnfoguy Bioinformatics 0 06-13-2011 06:04 PM
454 assembling against reference genome donniemarco 454 Pyrosequencing 2 08-17-2009 06:22 AM

Reply
 
Thread Tools
Old 02-19-2009, 02:00 PM   #1
query
Junior Member
 
Location: boston

Join Date: Feb 2009
Posts: 3
Default mapping 454 reads to a reference genome

What is the best tool available to map 454 reads to a reference genome? What is the method used by gs reference Mapper (analysis tool that comes with 454) and does it do a decent job of mapping and identifying variants?
query is offline   Reply With Quote
Old 06-19-2009, 01:20 AM   #2
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 266
Default

I'd like to know the answer to the exact same question. Anyone got any benchmarks for mapping 454 data? References?


Am I right in thinking that only the GS Reference Mapper uses the correct error model for the 454 data? Does anyone know how they actually do that? i.e. How is the alignment done in flowspace?

Actually, from the manual: "read to reference alignments are made in nucleotide space, the consensus basecalling and quality value determination for contigs are performed in flowspace."

So the questions are:

How is the consensus sequence created?
(How) could the mapping be improved by using flowspace, and does any software do that?
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 07-07-2009, 06:28 PM   #3
454investigator
Junior Member
 
Location: FL

Join Date: Jul 2009
Posts: 1
Default Mapping 454 reads to a reference genome

I am also trying to find an answer to this question. There are many assemblers out there, but only one mapper?
454investigator is offline   Reply With Quote
Old 07-07-2009, 11:21 PM   #4
Tuxido
Member
 
Location: Nijmegen, Netherlands

Join Date: Jun 2009
Posts: 22
Default

We've been using the gsMapper for mapping human DNA to the hg18 reference genome. Mainly because of ease of use, and the fact that most of the mappers available are for short reads. All I can say that we got good results from mapping. We sometimes did a blast search on specific reads to check and these all checked out.
We compared it briefly with the mapping from CLC bio and saw no major differences but we didn't look into it that much.

In any run we can usually map about 99% of the bases.
Tuxido is offline   Reply With Quote
Old 07-08-2009, 01:29 AM   #5
Roald
Director at CLC bio
 
Location: Denmark

Join Date: Aug 2008
Posts: 26
Default 454 mapping benchmark

Some benchmarks of how the CLC bio assembler maps 454 reads can be found in the white paper at http://www.clcbio.com/files/whitepap...C_NGS_Cell.pdf

Cheers

Roald Forsberg

Disclaimer: I work at CLC bio
Roald is offline   Reply With Quote
Old 08-17-2010, 10:57 AM   #6
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default 454 mapping Lastz

I've used Galaxy a little bit for read mapping. (search around here for more explanation about what Galaxy is, if you don't know).

http://main.g2.bx.psu.edu/

But within Galaxy they implement an algorithm called Lastz for mapping 454 reads.

I don't know much about Lastz except that it was not particularly useful for my data which is mRNA (results in alignment gaps due to splicing..).
chrisbala is offline   Reply With Quote
Old 08-17-2010, 12:33 PM   #7
aleferna
Senior Member
 
Location: sweden

Join Date: Sep 2009
Posts: 121
Default

I've tested a few algorithms and the best one so far for big 454 reads has been bwa bwasw ( the long read aligner ). I would suggest though to use z = 10 or 100 if you have a big enough computer. There is another one that is highly recommended (by the maker of BWA) called Novoalign, but I haven't tried that one. BWA is easy to run, its fast and does a pretty good job as far as accuracy and specificity. It will NOT map reads smaller than 30bp, for that you have to use the bwa short aligner. Since my 454 run had reads of many sizes and in particular short reads i had to use both.
aleferna is offline   Reply With Quote
Old 08-17-2010, 01:53 PM   #8
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

On the downside, bwasw does not use paired-end information and it does not report more than one hit. 1000 genomes project is using ssaha2 as speed is not a big concern given relatively fewer 454 reads in the project. I have not used novoalign for 454. Its 454 mapping is a new function added not long ago, but I trust the novoalign developer to deliver a good product.

@aleferna

Thanks for your patches. I have not got time to commit them. I will definitely do this when I want to modify BWA again.

Last edited by lh3; 08-17-2010 at 02:06 PM.
lh3 is offline   Reply With Quote
Old 08-17-2010, 09:45 PM   #9
aleferna
Senior Member
 
Location: sweden

Join Date: Sep 2009
Posts: 121
Default

The first time I ran BWA with the long aligner I didn't realize that there was a short/long option and since I have both in my library I was very disappointed of BWA. I started testing algorithm after algorithm and finally reviewed BWA again. This time I made a small script that will just join 2 sam files, one for the small aligner and one from the long aligner. It will choose the alignment from the short aligner if it cannot find it in the long aligner, this was the winning combination.

I've mentioned this chart in another thread, but here you can see that BWA is the only one that can cover the full range of read sizes in 454 datasets (or in 100bp solexa data after you remove the pair end adapters!)

http://www.nada.kth.se/~afer/benchmark.jpeg

Moreover, I know using the Z=100 seems a bit of an overkill but with 454 data and a decent computer BWA will take just a few minutes and I did measure Z=1,10,25,50,100,250 and even 500. Z = 100 seems to be the peak, after this I cannot squeeze any specificity out of the algorithm, but you do see a change from Z=10 to Z=100.

Last edited by aleferna; 08-17-2010 at 09:52 PM.
aleferna is offline   Reply With Quote
Old 08-18-2010, 02:10 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Has anyone compared MIRA3 in mapping mode? That can map 454 reads onto a reference.
maubp is offline   Reply With Quote
Old 08-18-2010, 05:23 AM   #11
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

@aleferna

Do you simulate indels? As I tried blat in the bwa-sw paper, it was far less accurate on 200bp simulated reads with 5% error rate. The table in the paper has -fastMap switched on, but the default is not much better. I do not want to claim anything bad for blat if that is my fault!

EDIT: This is the table:

http://bioinformatics.oxfordjournals...ll/26/5/589/T1

EDIT2: I think I might have figured out the difference. BLAT outputs multiple hits. Which hit are you choosing? Do you claim BLAT map the read correctly if one of the hits is correct, or only if the top-scored hit is correct? I use the latter. The script for converting PSL to SAM and calculating the BLAT alignment score (SAM AS tag):

http://samtools.svn.sourceforge.net/...23&view=markup

Last edited by lh3; 08-18-2010 at 05:49 AM.
lh3 is offline   Reply With Quote
Old 08-18-2010, 06:04 AM   #12
aleferna
Senior Member
 
Location: sweden

Join Date: Sep 2009
Posts: 121
Default

@lh3

No I didn't simulate indels in my study, that might be the difference, but also the % of deletions and insertions vs mutations that I used simulate the characteristics of my own dataset. This dataset has a lot of deletions and insertions that are actually caused by the homopolymer problem in 454. I think though, that blat should be as sensitive to 5 bp gaps as a 1bp gap. I mean if you have even a 1bp gap none of the tiles that overlap will spawn an alignment. Therefore I don't think that indels should make that much difference in blat. The other difference is also that I have a different mapping threshold for each read size.

I'm just a humble master student, but if you ask me the -fastMap option is a big big game changer. if you run with default options it is much better at higher error rates, low error rates remain pretty much the same. Nevertheless Blat can be a sensitive as you want. You can even use stepSize=1 and tileSize=8 to get amazing accuracy, but of course it did take about 30,000 CPU hours to finish through one of my libraries (had to use over 600 nodes on that one). Actually Blat with high sens is what I use to validate BWA.

So I think Blat IS more sensitive than BWA if you have a 75TFlop cluster. I read your BWA paper and I think you cannot really compare BWA and Blat because they operate in very different time scales. What you can say is that BWA gives you 99% accuracy / 90% sensitivity N times faster.

Last edited by aleferna; 08-18-2010 at 06:19 AM.
aleferna is offline   Reply With Quote
Old 08-18-2010, 06:45 AM   #13
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

How do you deal with multiple hits from the BLAT output? Would you say blat gives the correct answer as long as one of the hits is correct?
lh3 is offline   Reply With Quote
Old 08-18-2010, 06:56 AM   #14
aleferna
Senior Member
 
Location: sweden

Join Date: Sep 2009
Posts: 121
Default

No, I implemented a MapQ score: BestScore - 2ndBestScore (I don't divide by total score like you, this seems to improve the accuracy). I then optimized the minimum mapq score (you use a fixed one of 10 I think) per read length and error rate by calculating the lowest score that assures a 99% specificity and then I report the sensitivity with that threshold.
aleferna is offline   Reply With Quote
Old 08-18-2010, 07:06 AM   #15
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Which score are you using? Is it the first column in the PSL (the number of matching bases)?
lh3 is offline   Reply With Quote
Old 08-18-2010, 07:12 AM   #16
aleferna
Senior Member
 
Location: sweden

Join Date: Sep 2009
Posts: 121
Default

Nop, I calculate score as matches - mismatches - gapbases
aleferna is offline   Reply With Quote
Old 08-18-2010, 07:13 AM   #17
aleferna
Senior Member
 
Location: sweden

Join Date: Sep 2009
Posts: 121
Default

also if I can venture a suggestion, when I implemented the blat mapq value using S1 - S2, I noticed a big effect on the MinMapQ the MapQ threshold needed to achieve 99% specificity, as you have longer reads you need a lower thresholds, but also as your error rate decreases you need a lower threshold as well. I've been wondering that instead of using the S1 to normalize the value, if you could normalize by some sort of combination between the error rate and the read length. I think you can better approximate MapQ by combining these 2 components rather than trying to summarize them in S1.

Last edited by aleferna; 08-18-2010 at 07:20 AM.
aleferna is offline   Reply With Quote
Old 08-18-2010, 07:21 AM   #18
aleferna
Senior Member
 
Location: sweden

Join Date: Sep 2009
Posts: 121
Default

Here's the behavior of a simple Blat MapQ value
Attached Images
File Type: jpg Selection_006.jpg (20.5 KB, 70 views)
aleferna is offline   Reply With Quote
Old 08-18-2010, 07:55 AM   #19
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I gradually recall the decision on choosing the parameters for blat. My focus was more on >=500bp reads. And for these reads, blat -fastMap is similar to blat deault in accuracy but tens of times faster. However, for shorter reads which you are focusing on, blat default is much more accurate than blat -fastMap (still much slower, though). Your table would largely agree with mine for blat default.

Actually for 454, I would highly recommend ssaha2. Ssaha2 is designed for mapping sequencing data and calling SNPs from the first day and has been thoroughly validated. Blat, although being one of the best tools for mapping ESTs, is not for SNP finding initially and is not heavily evaluated. From what I have heard, blat does not refine the final alignment, which may make gaps positioned suboptimally and pose problems to indel finding. The default blat mode is also much slower and less accurate than ssaha2. In my view, it is a common mistake to overlook the superiority of ssaha2 for longer reads. The 1000 genomes project chooses every program for a reason.

Last edited by lh3; 08-18-2010 at 07:58 AM.
lh3 is offline   Reply With Quote
Old 08-18-2010, 08:46 AM   #20
aleferna
Senior Member
 
Location: sweden

Join Date: Sep 2009
Posts: 121
Default

@Adamo

Sorry if I'm spamming you, I don't understand how the private messages work here. Send me a message to afer at kth.se I can send you the script to joing the 2 BWA files if you still want to use bwa.
aleferna is offline   Reply With Quote
Reply

Tags
454, bwa-sw, ssaha2

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:47 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO