Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mapping 454 reads to a reference genome

    What is the best tool available to map 454 reads to a reference genome? What is the method used by gs reference Mapper (analysis tool that comes with 454) and does it do a decent job of mapping and identifying variants?

  • #2
    I'd like to know the answer to the exact same question. Anyone got any benchmarks for mapping 454 data? References?


    Am I right in thinking that only the GS Reference Mapper uses the correct error model for the 454 data? Does anyone know how they actually do that? i.e. How is the alignment done in flowspace?

    Actually, from the manual: "read to reference alignments are made in nucleotide space, the consensus basecalling and quality value determination for contigs are performed in flowspace."

    So the questions are:

    How is the consensus sequence created?
    (How) could the mapping be improved by using flowspace, and does any software do that?
    Homepage: Dan Bolser
    MetaBase the database of biological databases.

    Comment


    • #3
      Mapping 454 reads to a reference genome

      I am also trying to find an answer to this question. There are many assemblers out there, but only one mapper?

      Comment


      • #4
        We've been using the gsMapper for mapping human DNA to the hg18 reference genome. Mainly because of ease of use, and the fact that most of the mappers available are for short reads. All I can say that we got good results from mapping. We sometimes did a blast search on specific reads to check and these all checked out.
        We compared it briefly with the mapping from CLC bio and saw no major differences but we didn't look into it that much.

        In any run we can usually map about 99% of the bases.

        Comment


        • #5
          454 mapping benchmark

          Some benchmarks of how the CLC bio assembler maps 454 reads can be found in the white paper at http://www.clcbio.com/files/whitepap...C_NGS_Cell.pdf

          Cheers

          Roald Forsberg

          Disclaimer: I work at CLC bio

          Comment


          • #6
            454 mapping Lastz

            I've used Galaxy a little bit for read mapping. (search around here for more explanation about what Galaxy is, if you don't know).

            http://main.g2.bx.psu.edu/

            But within Galaxy they implement an algorithm called Lastz for mapping 454 reads.

            I don't know much about Lastz except that it was not particularly useful for my data which is mRNA (results in alignment gaps due to splicing..).

            Comment


            • #7
              I've tested a few algorithms and the best one so far for big 454 reads has been bwa bwasw ( the long read aligner ). I would suggest though to use z = 10 or 100 if you have a big enough computer. There is another one that is highly recommended (by the maker of BWA) called Novoalign, but I haven't tried that one. BWA is easy to run, its fast and does a pretty good job as far as accuracy and specificity. It will NOT map reads smaller than 30bp, for that you have to use the bwa short aligner. Since my 454 run had reads of many sizes and in particular short reads i had to use both.

              Comment


              • #8
                On the downside, bwasw does not use paired-end information and it does not report more than one hit. 1000 genomes project is using ssaha2 as speed is not a big concern given relatively fewer 454 reads in the project. I have not used novoalign for 454. Its 454 mapping is a new function added not long ago, but I trust the novoalign developer to deliver a good product.

                @aleferna

                Thanks for your patches. I have not got time to commit them. I will definitely do this when I want to modify BWA again.
                Last edited by lh3; 08-17-2010, 02:06 PM.

                Comment


                • #9
                  The first time I ran BWA with the long aligner I didn't realize that there was a short/long option and since I have both in my library I was very disappointed of BWA. I started testing algorithm after algorithm and finally reviewed BWA again. This time I made a small script that will just join 2 sam files, one for the small aligner and one from the long aligner. It will choose the alignment from the short aligner if it cannot find it in the long aligner, this was the winning combination.

                  I've mentioned this chart in another thread, but here you can see that BWA is the only one that can cover the full range of read sizes in 454 datasets (or in 100bp solexa data after you remove the pair end adapters!)



                  Moreover, I know using the Z=100 seems a bit of an overkill but with 454 data and a decent computer BWA will take just a few minutes and I did measure Z=1,10,25,50,100,250 and even 500. Z = 100 seems to be the peak, after this I cannot squeeze any specificity out of the algorithm, but you do see a change from Z=10 to Z=100.
                  Last edited by aleferna; 08-17-2010, 09:52 PM.

                  Comment


                  • #10
                    Has anyone compared MIRA3 in mapping mode? That can map 454 reads onto a reference.

                    Comment


                    • #11
                      @aleferna

                      Do you simulate indels? As I tried blat in the bwa-sw paper, it was far less accurate on 200bp simulated reads with 5% error rate. The table in the paper has -fastMap switched on, but the default is not much better. I do not want to claim anything bad for blat if that is my fault!

                      EDIT: This is the table:



                      EDIT2: I think I might have figured out the difference. BLAT outputs multiple hits. Which hit are you choosing? Do you claim BLAT map the read correctly if one of the hits is correct, or only if the top-scored hit is correct? I use the latter. The script for converting PSL to SAM and calculating the BLAT alignment score (SAM AS tag):

                      Download SAM tools for free. SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAMtools provide efficient utilities on manipulating alignments in the SAM format.
                      Last edited by lh3; 08-18-2010, 05:49 AM.

                      Comment


                      • #12
                        @lh3

                        No I didn't simulate indels in my study, that might be the difference, but also the % of deletions and insertions vs mutations that I used simulate the characteristics of my own dataset. This dataset has a lot of deletions and insertions that are actually caused by the homopolymer problem in 454. I think though, that blat should be as sensitive to 5 bp gaps as a 1bp gap. I mean if you have even a 1bp gap none of the tiles that overlap will spawn an alignment. Therefore I don't think that indels should make that much difference in blat. The other difference is also that I have a different mapping threshold for each read size.

                        I'm just a humble master student, but if you ask me the -fastMap option is a big big game changer. if you run with default options it is much better at higher error rates, low error rates remain pretty much the same. Nevertheless Blat can be a sensitive as you want. You can even use stepSize=1 and tileSize=8 to get amazing accuracy, but of course it did take about 30,000 CPU hours to finish through one of my libraries (had to use over 600 nodes on that one). Actually Blat with high sens is what I use to validate BWA.

                        So I think Blat IS more sensitive than BWA if you have a 75TFlop cluster. I read your BWA paper and I think you cannot really compare BWA and Blat because they operate in very different time scales. What you can say is that BWA gives you 99% accuracy / 90% sensitivity N times faster.
                        Last edited by aleferna; 08-18-2010, 06:19 AM.

                        Comment


                        • #13
                          How do you deal with multiple hits from the BLAT output? Would you say blat gives the correct answer as long as one of the hits is correct?

                          Comment


                          • #14
                            No, I implemented a MapQ score: BestScore - 2ndBestScore (I don't divide by total score like you, this seems to improve the accuracy). I then optimized the minimum mapq score (you use a fixed one of 10 I think) per read length and error rate by calculating the lowest score that assures a 99% specificity and then I report the sensitivity with that threshold.

                            Comment


                            • #15
                              Which score are you using? Is it the first column in the PSL (the number of matching bases)?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X