SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Programs for 454 data sequence alignment and mapping Abishai3911 454 Pyrosequencing 0 06-30-2011 03:12 PM
Are there any alignment programs that take SNPs into account during alignment? sdarko Bioinformatics 2 06-04-2011 06:09 AM
Online Bisulfite Alignment programs? hxm44 Bioinformatics 4 05-18-2011 05:11 PM
Alignment programs comparison bioxyz Bioinformatics 0 11-25-2009 04:33 PM
Alignment/assembly programs - Patent minefield GerryB Bioinformatics 3 06-08-2009 09:05 AM

Reply
 
Thread Tools
Old 03-07-2008, 05:28 AM   #1
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default Preliminary benchmark of different alignment programs

Here I will present a small comparison between different read alignment programs, including eland, soap, rmap and maq. The comparison is designed as follows. From human reference chrX, I simulated 1 million pairs of 32bp reads (i.e. 2 million reads) with maq. Maq first simulated a diploid sequence (two haploid sequences) by adding 135351 substitutions, 7650 1bp insertions and 7510 1bp deletions, and then generated reads from the diploid sequence. The base quality were generated based on real data. Sequencing errors were added based on qualities. As we know the exact position where the reads come from, we can exactly count the number of reads that are wrongly mapped by a certain alignment program.

To facilitate comparison, I wrote a Perl script 'paf_utils.pl' that converts various alignment format to the socalled PAF format (Pairwise Alignment Format), which is only for my own use at the moment.

Important Notes
===============

* This benchmark favours maq because I have not squeezed the best performance out of other software. Here are the reasons.

* eland: eland is able to make use of paired end (PE) information and generate accurate mapping qualities. However, these functionality are assisted by several scripts in the GAPipeline-0.3.0 and for the moment I do not know how to run these scripts independently without invoking the full Gerald module of the pipeline. I was running eland in the single end mode without mapping qualities.

* rmap: rmapq is able to use base qualities to improve the alignment. However, it requires _prb.txt format which is rarely used at the Sanger Institute and in Illumina as well. Maq does not support this format in simulation and therefore rmapq is not evaluated. I am sure rmapq would work much better than rmap if base qualities were available.

* soap: soap also comes with the PE alignment mode. Unfortunately, it segfaults when I try to use this mode. Perhaps my input file is buggy or I was doing stupid things, but anyway, I could not evaluate soap in PE alignment mode. If this mode worked, soap would get higher mapping accuracy.

* shrimp: I tuned the parameters for faster speed but with lower sensitivity. The overall accuracy would be affected. In addition, I was using version 1.04 instead of the latest 1.05. The latest version has been optimized a bit for mapping Illumina reads.

* In simulation we assume that base qualities are accurate, but this is not the case on real data. When base qualities are inaccurate, maq's mapping qualities will be less accurate, which will also reduce the accuracy of alignments. rmapq will suffer from the similar problem more or less.

* The CPU time is only a rough approximate. It is subjected to the conditions (e.g. memory and cache uses) of the machine where the program runs. On different conditions, the speed may vary up to 30%.

Results
=======

* eland (GAPipeline-0.3.0, mapping quality ROUGHLY estimated assuming uniform quality Q25)
- CMD: ./eland_32 r12.fa chrX eland.txt
- CPU time: 284.64 sec
- res memory: 383 MB
- # Q0;Q10;Q20 mapped reads: 1,686,129;1678776;1622577
- # Q0;Q10;Q20 wrongly mapped reads: 7,406;4234;819 (p_err: 0.439%;0.252%;0.0505%)
- Other features: PE alignment (not evaluated); mapping qualities (not evaluated); counts of the alternative places.

* rmap (0.2)
- CMD: ./rmap r12.fa -m 2 -o rmap.txt -c chrX.fa -w 32 -v
- CPU time: 2230.07 sec
- res memory: 894 MB
- # mapped reads: 1,686,129
- # wrongly mapped reads: 7,408 (p_err: 0.4393%)
- Other features: mapping using base qualities (not evaluated); 3 or more mismatches (not evaluated)

* maq-se (0.6.3-33, almost identical to 0.6.4)
- CMD: maq map maq_se.map chrX.bfa r12.bfq
- CPU time: 1039.89 sec
- res memory: 819 MB
- # Q0;Q10;Q20 mapped reads: 1946152; 1665959; 1614122
- # Q0;Q10;Q20 wrongly mapped reads: 200563; 1317; 320 (p_err: 10.33%; 0.0790%; 0.0198%)

* maq-pe (0.6.3-33, almost identical to 0.6.4)
- CMD: maq map maq_pe.map chrX.bfa r1.bfq r2.bfq
- CPU time: 1256.13 sec
- res memory: 674 MB
- # Q0;Q10;Q20 mapped reads: 1949177; 1756368; 1728480
- # Q0;Q10;Q20 wrongly mapped reads: 68542; 281; 153 (p_err: 3.516%; 0.0160%; 0.0088%)
- # reads mapped with indel: 2277
- # wrongly mapped indel reads: 0 (p_err = 0/2277 = 0%)

* soap-c0r0g0s (1.07, CPU time is measured with 1.05)
- CMD: ./soap -c 0 -r 0 -g 0 -a r12.fa -d chrX.fa -o soap_c0r0g0s.sop
- CPU time: 1228.120 sec (or 1417.29 user time/372.44 real time if '-p 4' is applied)
- res memory: 963 MB
- # mapped reads: 1,686,034
- # wrongly mapped reads: 7,840 (p_err: 0.4650%)

* soap-c0r0g3s (1.07, CPU time with 1.05)
- CMD: ./soap -c 0 -r 0 -g 3 -a r12.fa -d chrX.fa -o soap_c0r0g3s.sop
- CPU time: 1398.95 sec
- res memory: 963 MB
- # mapped reads: 1,687,887
- # wrongly mapped reads: 7,854 (p_err: 0.4653%)
- # reads mapped with indel: 1853
- # wrongly mapped indel reads: 14 (p_err: 0.756%)

* soap-c42r0g3s (1.07, CPU time with 1.05)
- CMD: ./soap -c 42 -r 0 -g 3 -a r12.fa -d chrX.fa -o soap_c42r0g3s.sop
- CPU time: 1517.41 sec
- res memory: 963 MB
- # mapped reads: 1,698,212
- # wrongly mapped reads: 8,131 (p_err: 0.4788%)
- # reads mapped with indel: 2207
- # wrongly mapped indel reads: 41 (p_err: 1.86%)

* shrimp (1.04)
- CMD: ./rmapper-ls -s 111111011111 -w 35 -n 3 -r 32 -o 20 -d -1 -h 2675 r12.fa chrX.fa
- CPU time: 11875.75 sec
- res memory: 3085MB
- # all;normodds>0.5 mapped reads: 1930350;1587003
- # all;normodds>0.5 wrongly mapped reads: 204355;2414 (p_err: 10.6%;0.152%)
- # all;normodds>0.5 reads mapped with indel: 2062;1817
- # all;normodds>0.5 wrongly mapped indel reads: 212;24 (p_err: 10.3%;1.38%)

Additional Notes
================

* eland: eland is definitely the fastest, much faster than all the competitors. What is more important, eland gives the number of alternative places, which makes it possible for you to get further information about the repetitive structures of the genome and to select reads that can be really confidently mapped. In addition, with the help of additional scripts, Eland IS able to map reads longer than 32bp. Eland is one of the best software I ever used. It would be even superior if Tony could make it easier to use for a user, like me, who wants to run eland independently of the GAPipeline.

* rmap: the strength of rmap is to use base qualities to improve the alignment accuracy. I believe it can produce better alignment than maq-se because maq trades accuracy for speed at this point (technically it is a bit hard to explain here). Nonetheless, I think rmap would be more popular if its authors could add support for fastq-like quality string which is now the standard in both Illumina and the Sanger Institute (although maybe not elsewhere). rmap supports longer reads, which is also a gain. Furthermore, I did learn a lot from its way to count the number of mismatches.

* soap: soap is a versatile program. It supports iterative-trimmed alignment, long reads, gapped alignment, TAG alignment and PE mode. Its PE mode is easier to use than eland. In principle, soap and eland should give almost the same number of wrong alignments. However, soap gives 442 more wrong alignments. Further investigation shows that most of these 442 wrong ones are flagged as R? (repeat) by eland.

* shrimp: Actually I was not expecting that a program using seeding+Smith-Waterman could be that fast. So far as I know, all the other software in the list do not do Smith-Waterman (maq does for PE data only), which is why they are fast. SHRiMP's normodds score has similar meaning to mapping quality. Such score helps to determine whether an alignment is reliable. The most obvious advantage is SHRiMP can map long reads (454/capillary) with the standard gapped alignment. If you only work with small genomes, SHRiMP is a worthy choice. I think SHRiMP would be better if it could make use of paired end information; it would be even better if it could calculate mapping quality. The current normodds score helps but is not exactly the mapping quality. In addition, I also modified probcalc codes because in 1.04 underflow may occur to long reads and leads to "nan" normodds. However, although my revision fixes the underflow, it may lead to some inaccurate normodds.

* maq: at the moment maq is easier to use than eland. Supporting SNP calling is maq's huge gain. Its paired end mode is also highly helpful to recover some repetitive regions. Maq's random mapping, which is frequently misused by users who have not noticed mapping qualities, may be useful to some people, too, and at least it helps to call SNPs at the verge of repeats.

Update
======

* 2008-03-13
- Use maq version 0.6.3-33, which is faster than the public version 0.6.3.
- Compile all the programs with Intel C/C++ compiler and run all the programs on the same 8-core 64bit machine (Xeon-L5320 1.86GHz) with 8GB memory. Multi-threading is NOT used. Previously I run some programs on a faster machine (Xeon-5150 2.66GHz) and some on a slower machine (1.86GHz), which is unfair.
- I invoked two programs at a time. The combination is: (maq_se,maq_pe), (eland,eland-multi), (rmap,soap_c0r0g0s), (soap_c0r0g3s,soap_c42r0g3s). Note that processes on the same machine may compete with the memory bandwidth and cache uses, which may affect the speed.

* 2008-03-14
- Test multithreading of SOAP.

* 2008-03-22
- Use soap_1.07. But PE mode did not work for me unfortunately.
- Add comments for the additional wrong alignments of soap (in comparison to eland).
- Add "fake" eland mapping quality (done by my paf_utils.pl script).
- Evaluate indel alignment. Note that I am using soap's single-end mode. PE mode would give better accuracy.

* 2008-03-27
- Evaluate SHRiMP-1.04

Last edited by lh3; 03-27-2008 at 03:07 PM.
lh3 is offline   Reply With Quote
Old 03-07-2008, 11:32 AM   #2
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Hi,
thanks for a nice benchmark. So far, I have only just tried the SOAP program for Solexa reads but it seems to be very fast. How can Eland be so much faster, even when SOAP is doing ungapped alignments? As I understand it the algorithm is more or less the same. What system was this run on and are the programs using the same number of cores?

cheers,
Ola
Chipper is offline   Reply With Quote
Old 03-10-2008, 10:12 AM   #3
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

great work !!
bioinfosm is offline   Reply With Quote
Old 03-10-2008, 11:27 AM   #4
Malarky67
Junior Member
 
Location: London, UK

Join Date: Jan 2008
Posts: 5
Default

Hi
SOAP does seem slow. You haven't set it to work on more than one processor. Was this simulation done on a single processor machine?

e.g. from SOAP usage

-p <int> number of processors to use, default=1


Does Eland automatically have the same behaviour?
Malarky67 is offline   Reply With Quote
Old 03-10-2008, 03:12 PM   #5
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Quote:
Originally Posted by Malarky67 View Post
Hi
SOAP does seem slow. You haven't set it to work on more than one processor. Was this simulation done on a single processor machine?

e.g. from SOAP usage

-p <int> number of processors to use, default=1


Does Eland automatically have the same behaviour?
I tried SOAP some more and it seems to adjust seed size to available memory, which is likely to affect the run time. As I understand it Eland places reads not reference in memory.
Chipper is offline   Reply With Quote
Old 03-10-2008, 03:37 PM   #6
Malarky67
Junior Member
 
Location: London, UK

Join Date: Jan 2008
Posts: 5
Default

Quote:
Originally Posted by Chipper View Post
I tried SOAP some more and it seems to adjust seed size to available memory, which is likely to affect the run time. As I understand it Eland places reads not reference in memory.
Yes. That is what I have heard. Does anyone understand how these algorithms are parallelised across multiple processors or even nodes of a cluster? (especially in the case of nodes are reference tables built for each node?)
Malarky67 is offline   Reply With Quote
Old 03-13-2008, 04:47 PM   #7
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Eland supports multithreading in its source codes, but apparently it is not activated by default (if I am right). In fact, as Eland is fast and small, we can invoke several eland on a node at the same time. Parallelization on a large cluster can be easily managed by LSF or SGE.

On a node, SOAP can be parallelized with -p (the CPU time should be similar with or without -p) and on a cluster you can also use LSF/SGE. However, as soap may require huge memory, you will have to design a clever strategy, based on the hardware configurations, to run it efficiently without breaking your clusters. In addition, LSF/SGE usually prefers single-thread jobs. Multi-thread jobs may reduce the overall efficiency of a cluster unless all the nodes in the cluster has dozens of processors.

In my view, indexing the reference helps to get faster speed but tends to be memory demanding for human alignment. Indexing reads is more scalable but may sacrify some speed. Eland is still faster firstly because Tony Cox at Illumina is one of the best programmers in this field and secondly because soap has to trade speed for memory. Using long seed for human alignment is impractical.

Anyway, all the software in this benchmark are great. Sometimes which to use is just subjected to your appetite.
lh3 is offline   Reply With Quote
Old 03-14-2008, 08:24 AM   #8
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

I just tried SOAP with and without -p 4 on a quadcore, it was about 3 times faster with all four cores active (2.5 M reads, chr1).
Chipper is offline   Reply With Quote
Old 03-14-2008, 09:41 AM   #9
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Quote:
Originally Posted by Chipper View Post
I just tried SOAP with and without -p 4 on a quadcore, it was about 3 times faster with all four cores active (2.5 M reads, chr1).
You must be couting the wall-clock time or the time soap was printing out. Although wall-clock time should be considered, it is more important to evaluate on processor time. If you split your read file into 4 chunks and align them with 4 eland, I am sure eland will be still several times faster.

Here is the follow-up. When '-p 4 -g 0 -c 0 -r 0' is applied:

real 6m12.443s
user 23m37.292s (=1417.292 sec)
sys 0m3.799s

As we would expect, it takes more user time (vs. 1228.12 sec). Anyway, multithreading is still useful if you do not have many reads and want to get the result quickly. This is definitely the gain of multithreading. SOAP is also a great software. Thank you for pointing this out.

Last edited by lh3; 03-14-2008 at 10:10 AM.
lh3 is offline   Reply With Quote
Old 03-14-2008, 10:36 AM   #10
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

I was not aware that there is more than one clock to watch... Anyway, the main advantage for SOAP over Eland is that you can actually download and use it, plus the ability to use longer reads, or trim them if no match is found. Which happens quite frequently...
Chipper is offline   Reply With Quote
Old 03-21-2008, 08:50 AM   #11
brudno
Member
 
Location: Toronto

Join Date: Mar 2008
Posts: 11
Default

Hi --

Any interest on trying this with SHRiMP? I am guessing it'll be much slower than the other tools, though we are working to improve this. While the main use for shrimp is probably going to be SOLiD, we do support solexa as well.

Cheers, -Mike
brudno is offline   Reply With Quote
Old 04-13-2008, 12:37 AM   #12
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Smile Dataset

Nice work Ih3,

I think this is an invaluable exercise amidst all the emerging tools. Is the benchmark dataset available anywhere because I'd like to test similar metrics on tools that we have in our lab?

Thanks for your help.
zee is offline   Reply With Quote
Old 04-14-2008, 02:39 AM   #13
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

The simulated data are free to use, of course, but unfortunately I could not find a FTP site to upload the data. You may try maq to simulate the reads by yourself at the moment. The read positions are coded in read names and so you can write your own script to evaluate the accuracy.
lh3 is offline   Reply With Quote
Old 04-14-2008, 07:48 AM   #14
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Thanks, I just got the maq package and I'd like to simulate data according to the same rules you used.
Is it correct to assume that in cases where I use eland that I won't be able to make use of the fastq format and instead I need to convert to fasta using a script?
The simulation steps seem quite straightforward.
zee is offline   Reply With Quote
Old 05-02-2008, 08:08 AM   #15
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

How does maq calculate mapping quality? I looked in the documentation and couldn't find anything. Is it based on the ASCII quality code in the query sequence?
zee is offline   Reply With Quote
Old 05-04-2008, 09:40 AM   #16
ShaunMahony
Member
 
Location: University Park, PA

Join Date: Apr 2008
Posts: 27
Default

I'm also interested in how MAQ assigns quality scores. Can you confirm what you meant by "Q0;Q10;Q20" in the MAQ tests? Is this a threshold on the quality score that MAQ gives the alignment (as opposed to the quality score of the read in the Fastq file)? If a read maps to multiple locations, MAQ reports one location at random and assigns a quality score of 0. Therefore the Q0 accuracy should be much less than if you had excluded these alignments. I think that this behavior is a bit strange; it would be less confusing if MAQ didn't report any matches for the non-uniquely mapping reads and instead reported the number of places that the read maps (the whole read, not just the first 25 bases).
ShaunMahony is offline   Reply With Quote
Old 05-04-2008, 10:11 AM   #17
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,352
Default

I'm definitely out of my league in this discussion, but if anyone needs hosting for some of these sample datasets, let me know!
ECO is offline   Reply With Quote
Old 05-04-2008, 12:16 PM   #18
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Q0;Q10;Q20 is threshold on the alignment quality score assigned by MAQ.

MAQ is initially designed for resequencing and keeping these repetitive reads is quite useful for the subsequent SNP calling with MAQ. This also helps CNV calling. I could understand that a lot of people do not want to see all these repetitive reads, but putting a threshold on mapping quality is very easy anyway. In addition, different people may want to set different different threshold.

As for the calculation of mapping quality, it just follows a very simple Bayesian procedure. You can calculate p(z|x,u) of read z mapped to u on the reference x. With Bayesian formula you get p(u|x,z). The mapping quality is -10log10(1-p(u|x,z)).
lh3 is offline   Reply With Quote
Old 05-04-2008, 12:32 PM   #19
ShaunMahony
Member
 
Location: University Park, PA

Join Date: Apr 2008
Posts: 27
Default

Thanks for the info Ih3. I agree that it is very useful to report the locations of repetitive / non-uniquely mappable reads. However, can MAQ be set to report ALL of the repetitive locations rather than just a single random one? I know that this is off-topic; apologies.
ShaunMahony is offline   Reply With Quote
Old 05-04-2008, 12:42 PM   #20
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

The latest version, 0.6.6, can output ALL hits with 0- or 1-mismatch in the seed.
lh3 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:04 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO