Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Preliminary benchmark of different alignment programs

    Here I will present a small comparison between different read alignment programs, including eland, soap, rmap and maq. The comparison is designed as follows. From human reference chrX, I simulated 1 million pairs of 32bp reads (i.e. 2 million reads) with maq. Maq first simulated a diploid sequence (two haploid sequences) by adding 135351 substitutions, 7650 1bp insertions and 7510 1bp deletions, and then generated reads from the diploid sequence. The base quality were generated based on real data. Sequencing errors were added based on qualities. As we know the exact position where the reads come from, we can exactly count the number of reads that are wrongly mapped by a certain alignment program.

    To facilitate comparison, I wrote a Perl script 'paf_utils.pl' that converts various alignment format to the socalled PAF format (Pairwise Alignment Format), which is only for my own use at the moment.

    Important Notes
    ===============

    * This benchmark favours maq because I have not squeezed the best performance out of other software. Here are the reasons.

    * eland: eland is able to make use of paired end (PE) information and generate accurate mapping qualities. However, these functionality are assisted by several scripts in the GAPipeline-0.3.0 and for the moment I do not know how to run these scripts independently without invoking the full Gerald module of the pipeline. I was running eland in the single end mode without mapping qualities.

    * rmap: rmapq is able to use base qualities to improve the alignment. However, it requires _prb.txt format which is rarely used at the Sanger Institute and in Illumina as well. Maq does not support this format in simulation and therefore rmapq is not evaluated. I am sure rmapq would work much better than rmap if base qualities were available.

    * soap: soap also comes with the PE alignment mode. Unfortunately, it segfaults when I try to use this mode. Perhaps my input file is buggy or I was doing stupid things, but anyway, I could not evaluate soap in PE alignment mode. If this mode worked, soap would get higher mapping accuracy.

    * shrimp: I tuned the parameters for faster speed but with lower sensitivity. The overall accuracy would be affected. In addition, I was using version 1.04 instead of the latest 1.05. The latest version has been optimized a bit for mapping Illumina reads.

    * In simulation we assume that base qualities are accurate, but this is not the case on real data. When base qualities are inaccurate, maq's mapping qualities will be less accurate, which will also reduce the accuracy of alignments. rmapq will suffer from the similar problem more or less.

    * The CPU time is only a rough approximate. It is subjected to the conditions (e.g. memory and cache uses) of the machine where the program runs. On different conditions, the speed may vary up to 30%.

    Results
    =======

    * eland (GAPipeline-0.3.0, mapping quality ROUGHLY estimated assuming uniform quality Q25)
    - CMD: ./eland_32 r12.fa chrX eland.txt
    - CPU time: 284.64 sec
    - res memory: 383 MB
    - # Q0;Q10;Q20 mapped reads: 1,686,129;1678776;1622577
    - # Q0;Q10;Q20 wrongly mapped reads: 7,406;4234;819 (p_err: 0.439%;0.252%;0.0505%)
    - Other features: PE alignment (not evaluated); mapping qualities (not evaluated); counts of the alternative places.

    * rmap (0.2)
    - CMD: ./rmap r12.fa -m 2 -o rmap.txt -c chrX.fa -w 32 -v
    - CPU time: 2230.07 sec
    - res memory: 894 MB
    - # mapped reads: 1,686,129
    - # wrongly mapped reads: 7,408 (p_err: 0.4393%)
    - Other features: mapping using base qualities (not evaluated); 3 or more mismatches (not evaluated)

    * maq-se (0.6.3-33, almost identical to 0.6.4)
    - CMD: maq map maq_se.map chrX.bfa r12.bfq
    - CPU time: 1039.89 sec
    - res memory: 819 MB
    - # Q0;Q10;Q20 mapped reads: 1946152; 1665959; 1614122
    - # Q0;Q10;Q20 wrongly mapped reads: 200563; 1317; 320 (p_err: 10.33%; 0.0790%; 0.0198%)

    * maq-pe (0.6.3-33, almost identical to 0.6.4)
    - CMD: maq map maq_pe.map chrX.bfa r1.bfq r2.bfq
    - CPU time: 1256.13 sec
    - res memory: 674 MB
    - # Q0;Q10;Q20 mapped reads: 1949177; 1756368; 1728480
    - # Q0;Q10;Q20 wrongly mapped reads: 68542; 281; 153 (p_err: 3.516%; 0.0160%; 0.0088%)
    - # reads mapped with indel: 2277
    - # wrongly mapped indel reads: 0 (p_err = 0/2277 = 0%)

    * soap-c0r0g0s (1.07, CPU time is measured with 1.05)
    - CMD: ./soap -c 0 -r 0 -g 0 -a r12.fa -d chrX.fa -o soap_c0r0g0s.sop
    - CPU time: 1228.120 sec (or 1417.29 user time/372.44 real time if '-p 4' is applied)
    - res memory: 963 MB
    - # mapped reads: 1,686,034
    - # wrongly mapped reads: 7,840 (p_err: 0.4650%)

    * soap-c0r0g3s (1.07, CPU time with 1.05)
    - CMD: ./soap -c 0 -r 0 -g 3 -a r12.fa -d chrX.fa -o soap_c0r0g3s.sop
    - CPU time: 1398.95 sec
    - res memory: 963 MB
    - # mapped reads: 1,687,887
    - # wrongly mapped reads: 7,854 (p_err: 0.4653%)
    - # reads mapped with indel: 1853
    - # wrongly mapped indel reads: 14 (p_err: 0.756%)

    * soap-c42r0g3s (1.07, CPU time with 1.05)
    - CMD: ./soap -c 42 -r 0 -g 3 -a r12.fa -d chrX.fa -o soap_c42r0g3s.sop
    - CPU time: 1517.41 sec
    - res memory: 963 MB
    - # mapped reads: 1,698,212
    - # wrongly mapped reads: 8,131 (p_err: 0.4788%)
    - # reads mapped with indel: 2207
    - # wrongly mapped indel reads: 41 (p_err: 1.86%)

    * shrimp (1.04)
    - CMD: ./rmapper-ls -s 111111011111 -w 35 -n 3 -r 32 -o 20 -d -1 -h 2675 r12.fa chrX.fa
    - CPU time: 11875.75 sec
    - res memory: 3085MB
    - # all;normodds>0.5 mapped reads: 1930350;1587003
    - # all;normodds>0.5 wrongly mapped reads: 204355;2414 (p_err: 10.6%;0.152%)
    - # all;normodds>0.5 reads mapped with indel: 2062;1817
    - # all;normodds>0.5 wrongly mapped indel reads: 212;24 (p_err: 10.3%;1.38%)

    Additional Notes
    ================

    * eland: eland is definitely the fastest, much faster than all the competitors. What is more important, eland gives the number of alternative places, which makes it possible for you to get further information about the repetitive structures of the genome and to select reads that can be really confidently mapped. In addition, with the help of additional scripts, Eland IS able to map reads longer than 32bp. Eland is one of the best software I ever used. It would be even superior if Tony could make it easier to use for a user, like me, who wants to run eland independently of the GAPipeline.

    * rmap: the strength of rmap is to use base qualities to improve the alignment accuracy. I believe it can produce better alignment than maq-se because maq trades accuracy for speed at this point (technically it is a bit hard to explain here). Nonetheless, I think rmap would be more popular if its authors could add support for fastq-like quality string which is now the standard in both Illumina and the Sanger Institute (although maybe not elsewhere). rmap supports longer reads, which is also a gain. Furthermore, I did learn a lot from its way to count the number of mismatches.

    * soap: soap is a versatile program. It supports iterative-trimmed alignment, long reads, gapped alignment, TAG alignment and PE mode. Its PE mode is easier to use than eland. In principle, soap and eland should give almost the same number of wrong alignments. However, soap gives 442 more wrong alignments. Further investigation shows that most of these 442 wrong ones are flagged as R? (repeat) by eland.

    * shrimp: Actually I was not expecting that a program using seeding+Smith-Waterman could be that fast. So far as I know, all the other software in the list do not do Smith-Waterman (maq does for PE data only), which is why they are fast. SHRiMP's normodds score has similar meaning to mapping quality. Such score helps to determine whether an alignment is reliable. The most obvious advantage is SHRiMP can map long reads (454/capillary) with the standard gapped alignment. If you only work with small genomes, SHRiMP is a worthy choice. I think SHRiMP would be better if it could make use of paired end information; it would be even better if it could calculate mapping quality. The current normodds score helps but is not exactly the mapping quality. In addition, I also modified probcalc codes because in 1.04 underflow may occur to long reads and leads to "nan" normodds. However, although my revision fixes the underflow, it may lead to some inaccurate normodds.

    * maq: at the moment maq is easier to use than eland. Supporting SNP calling is maq's huge gain. Its paired end mode is also highly helpful to recover some repetitive regions. Maq's random mapping, which is frequently misused by users who have not noticed mapping qualities, may be useful to some people, too, and at least it helps to call SNPs at the verge of repeats.

    Update
    ======

    * 2008-03-13
    - Use maq version 0.6.3-33, which is faster than the public version 0.6.3.
    - Compile all the programs with Intel C/C++ compiler and run all the programs on the same 8-core 64bit machine (Xeon-L5320 1.86GHz) with 8GB memory. Multi-threading is NOT used. Previously I run some programs on a faster machine (Xeon-5150 2.66GHz) and some on a slower machine (1.86GHz), which is unfair.
    - I invoked two programs at a time. The combination is: (maq_se,maq_pe), (eland,eland-multi), (rmap,soap_c0r0g0s), (soap_c0r0g3s,soap_c42r0g3s). Note that processes on the same machine may compete with the memory bandwidth and cache uses, which may affect the speed.

    * 2008-03-14
    - Test multithreading of SOAP.

    * 2008-03-22
    - Use soap_1.07. But PE mode did not work for me unfortunately.
    - Add comments for the additional wrong alignments of soap (in comparison to eland).
    - Add "fake" eland mapping quality (done by my paf_utils.pl script).
    - Evaluate indel alignment. Note that I am using soap's single-end mode. PE mode would give better accuracy.

    * 2008-03-27
    - Evaluate SHRiMP-1.04
    Last edited by lh3; 03-27-2008, 02:07 PM.

  • #2
    Hi,
    thanks for a nice benchmark. So far, I have only just tried the SOAP program for Solexa reads but it seems to be very fast. How can Eland be so much faster, even when SOAP is doing ungapped alignments? As I understand it the algorithm is more or less the same. What system was this run on and are the programs using the same number of cores?

    cheers,
    Ola

    Comment


    • #3
      great work !!
      --
      bioinfosm

      Comment


      • #4
        Hi
        SOAP does seem slow. You haven't set it to work on more than one processor. Was this simulation done on a single processor machine?

        e.g. from SOAP usage

        -p <int> number of processors to use, default=1


        Does Eland automatically have the same behaviour?

        Comment


        • #5
          Originally posted by Malarky67 View Post
          Hi
          SOAP does seem slow. You haven't set it to work on more than one processor. Was this simulation done on a single processor machine?

          e.g. from SOAP usage

          -p <int> number of processors to use, default=1


          Does Eland automatically have the same behaviour?
          I tried SOAP some more and it seems to adjust seed size to available memory, which is likely to affect the run time. As I understand it Eland places reads not reference in memory.

          Comment


          • #6
            Originally posted by Chipper View Post
            I tried SOAP some more and it seems to adjust seed size to available memory, which is likely to affect the run time. As I understand it Eland places reads not reference in memory.
            Yes. That is what I have heard. Does anyone understand how these algorithms are parallelised across multiple processors or even nodes of a cluster? (especially in the case of nodes are reference tables built for each node?)

            Comment


            • #7
              Eland supports multithreading in its source codes, but apparently it is not activated by default (if I am right). In fact, as Eland is fast and small, we can invoke several eland on a node at the same time. Parallelization on a large cluster can be easily managed by LSF or SGE.

              On a node, SOAP can be parallelized with -p (the CPU time should be similar with or without -p) and on a cluster you can also use LSF/SGE. However, as soap may require huge memory, you will have to design a clever strategy, based on the hardware configurations, to run it efficiently without breaking your clusters. In addition, LSF/SGE usually prefers single-thread jobs. Multi-thread jobs may reduce the overall efficiency of a cluster unless all the nodes in the cluster has dozens of processors.

              In my view, indexing the reference helps to get faster speed but tends to be memory demanding for human alignment. Indexing reads is more scalable but may sacrify some speed. Eland is still faster firstly because Tony Cox at Illumina is one of the best programmers in this field and secondly because soap has to trade speed for memory. Using long seed for human alignment is impractical.

              Anyway, all the software in this benchmark are great. Sometimes which to use is just subjected to your appetite.

              Comment


              • #8
                I just tried SOAP with and without -p 4 on a quadcore, it was about 3 times faster with all four cores active (2.5 M reads, chr1).

                Comment


                • #9
                  Originally posted by Chipper View Post
                  I just tried SOAP with and without -p 4 on a quadcore, it was about 3 times faster with all four cores active (2.5 M reads, chr1).
                  You must be couting the wall-clock time or the time soap was printing out. Although wall-clock time should be considered, it is more important to evaluate on processor time. If you split your read file into 4 chunks and align them with 4 eland, I am sure eland will be still several times faster.

                  Here is the follow-up. When '-p 4 -g 0 -c 0 -r 0' is applied:

                  real 6m12.443s
                  user 23m37.292s (=1417.292 sec)
                  sys 0m3.799s

                  As we would expect, it takes more user time (vs. 1228.12 sec). Anyway, multithreading is still useful if you do not have many reads and want to get the result quickly. This is definitely the gain of multithreading. SOAP is also a great software. Thank you for pointing this out.
                  Last edited by lh3; 03-14-2008, 09:10 AM.

                  Comment


                  • #10
                    I was not aware that there is more than one clock to watch... Anyway, the main advantage for SOAP over Eland is that you can actually download and use it, plus the ability to use longer reads, or trim them if no match is found. Which happens quite frequently...

                    Comment


                    • #11
                      Hi --

                      Any interest on trying this with SHRiMP? I am guessing it'll be much slower than the other tools, though we are working to improve this. While the main use for shrimp is probably going to be SOLiD, we do support solexa as well.

                      Cheers, -Mike

                      Comment


                      • #12
                        Dataset

                        Nice work Ih3,

                        I think this is an invaluable exercise amidst all the emerging tools. Is the benchmark dataset available anywhere because I'd like to test similar metrics on tools that we have in our lab?

                        Thanks for your help.

                        Comment


                        • #13
                          The simulated data are free to use, of course, but unfortunately I could not find a FTP site to upload the data. You may try maq to simulate the reads by yourself at the moment. The read positions are coded in read names and so you can write your own script to evaluate the accuracy.

                          Comment


                          • #14
                            Thanks, I just got the maq package and I'd like to simulate data according to the same rules you used.
                            Is it correct to assume that in cases where I use eland that I won't be able to make use of the fastq format and instead I need to convert to fasta using a script?
                            The simulation steps seem quite straightforward.

                            Comment


                            • #15
                              How does maq calculate mapping quality? I looked in the documentation and couldn't find anything. Is it based on the ASCII quality code in the query sequence?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X