Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trying to test the aligners

    Hello everybody!

    I'm starting to work in this field and one of the first things I tried to do is a comparison between different short read aligners (like bfast, bowtie...).
    The primary idea is to get an estimate of haw many reads can be mapped by each program given a set of reads with a measurable number of errors/variants.

    For the reads generation I used the utilities provided by the bfast package and, using the reference human genome, I generated one million reads for each combination of: read length (50, 76 and 100bp), pairing (paired/unpaired), #SNPs (from 0 to 5) and #errors (from 0 to 5).
    Then I used this inputs to feed the different algorithms and see how they struggle to match as much sequences as they can.

    After the execution, I counted the number of matches (and the time the code took to get the work done).

    Do you think the "experiment" is significative? Maybe I'm missing some point? Is the tests flawed in some way?

    Please, let me know your opinions! Thanks a lot!

  • #2
    Most aligners offer settings that let you trade off between sensitivity and speed. To compare them, you'll need to control one variable as much as possible by tuning them so they either take approximately the same time or align approximately the same number of reads. You should try a few levels of each, since different aligners may be do better in different areas.

    Specificity is also important, since aligning lots of reads quickly doesn't matter if they're not the optimal alignments. This complicates benchmarks even more, since sometimes you can change settings that affect specificity as well. A 3D plot for each aligner showing the relationship between all three variables would be interesting if you have lots of free time.

    Comment


    • #3
      I count with the parameters of the aligners. I'm trying to get the maximum number of correct results, regardless of time spent (up to some limit). But tweaking all the aligners is not very easy without knowing them in advance. Each one seems to have its tricks.

      About the specificity problem... I'm not really sure how to ensure it. How can I know that the alignment is optimal? In the case of perfect reads, without mismatches, I can compare the position of the read with the position predicted by the software, but one we start to introduce variations we can end up having a read that matches better a different area than the original. And deciding which is the optimal alignment is not easy. If it was... all the problems of read alignment would be gone.
      By now, I'm relying in the fact that if the aligner returns some alignment, this is a correct alignment.

      Comment


      • #4
        I would read http://lh3lh3.users.sourceforge.net/false-bench.shtml,
        http://lh3lh3.users.sourceforge.net/bioinfo.shtml, and http://www.nilshomer.com/index.php?title=NGS_Alignment before continuing. The first two are written by Heng Li (MAQ/BWA) and the latter is written by myself (BFAST/SRMA). Also, check out the original papers for each aligner as I am sure they performed alignment comparisons (how did they do it)? For a discussion of possible mapping errors, the supplementary materials in the MAQ is the best.

        Comment


        • #5
          Originally posted by Poshi View Post
          I'm trying to get the maximum number of correct results, regardless of time spent (up to some limit).
          I think "up to some limit" will be key here. I'm actually not familiar the settings for BFAST or Bowtie, but aligners I've used could theoretically have 100% sensitivity regardless of mismatches...it just might take years to run (and/or use way too much RAM). The developers chose default settings that would result in reasonable run times and RAM usage, but their definition of "reasonable" may vary from others and yours. You'll have to get to know the settings enough to choose your own reasonable limits.

          Originally posted by Poshi View Post
          In the case of perfect reads, without mismatches, I can compare the position of the read with the position predicted by the software, but one we start to introduce variations we can end up having a read that matches better a different area than the original.
          I would say that the "correct" alignment is the optimal match, regardless of where the read actually came from, since that's what you would want to find if you didn't know the actual origin. Of course, the definition of "optimal" varies from one tool to another also (there is usually some kind of p-value). Ideally you would do your own assessment that accounts for everything you need (paired-end optimization, quality scores, etc.), possibly based on one of the aligners' methods.

          As you can tell from nilshomer's links, this gets very complicated, but it's necessary to account for these things. You may want to at least restrict your tests to conditions closest to what your real data will look like.

          Comment


          • #6
            Acknowledging my own self-promotion, I would also read this paper: http://dx.doi.org/10.1093/bib/bbq015

            Comment


            • #7
              Thanks a lot for all the advice and the links provided. I have some things to read :-)
              I readed the papers of the different aligners, but I was not convinced by the results. Most of them claimed that their aligner was the best but... all of them cannot be "the best" at the same time! This was the reason that pushed me to do my own testing.

              I will follow the advice given in the links provided, although I think that not all the comments are appropriate or can be done (like the one that is advising me to run a long input instead a short input: because I'm only assessing the quality of the results and not its speed or scalability, this has no importance).

              When I talk about "time up to some limit" I'm thinking in a reasonable time for the test, say a couple of days. And when I'm talking about memory I will use the total amount of memory available in our machines. When I perform a time benchmarck, the limits will be different.

              What looks quite clear is that I should check the quality of the alignments by checking the read against the aligned reference. But it is supposed that an alignment is correct so... I will have to define what I consider "optimal".

              Thanks a lot for your comments. I'll try to improve my tests and maybe I will switch to a real data set to have a different scenario.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 11:49 AM
              0 responses
              15 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-24-2024, 08:47 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              61 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Working...
              X