Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Short Read Micro re-Aligner (beta release)

    We are pleased to announce the beta release of new tool called SRMA: the short read micro re-aligner. We have tested this method on human cancer resequencing datasets as well as performed validation with simulations. We wish to find beta testers to provide feedback and suggest new features to the tool.

    Link:


    Short description:
    Sequence alignment algorithms examine each read independently. When indels occur towards the ends of reads, the alignment can lead to false SNPs as well as improperly placed indels. This tool aims to perform a re-alignment of each read to a graphical representation of all alignments within a local region to provide a better overall base-resolution consensus.

    Features:
    - The input is a BAM, the output is BAM.
    - Specify a co-ordinate range for large-scale parallelism or local regions of interest.
    - SOLiD data is re-aligned using the original color space reads and qualities to maximally use all information available (SAM CS/CQ tags must be present).
    - A base correction mode for Illumina/454 data automatically recalls bases in the reads based on all alignments, removing spurious variants and adjusting their respective base qualities.

    Acknowledgments:
    Thanks to the Picard team for their fast responses to questions about the SAM/BAM Picard API. We would also like to thank the members of the Nelson Lab at UCLA.

    Sincerely,
    Nils Homer

  • #2
    Thanks, That looks like a nice bit of software.
    Just a question how does it handle CIGAR stretches of N in split read of RNA seq?

    Thanks
    Tim

    Comment


    • #3
      Thanks. this looks like a very useful piece of software!
      I tried this on BFAST output for SOLiD run and got the following error:
      ctr:200 AL:1:38:8_15_1136 50b aligned read. java.lang.Exception: Error: could not understand the base
      at srma.SRMAUtil.colorSpaceNextBase(SRMAUtil.java:77)
      at srma.SRMAUtil.normalizeColorSpaceRead(SRMAUtil.java:150)
      at srma.Align.align(Align.java:84)
      at srma.SRMA.processList(SRMA.java:378)
      at srma.SRMA.doWork(SRMA.java:254)
      at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
      at srma.SRMA.main(SRMA.java:81)
      If it would help I can probably look for the "ctr:200 AL:1:38:8_15_1136 50b" read in the sam file and ouput it.

      Meanwhile, I'm trying this on a BWA output for the same run which interestingly enough got me very bad pileup output (maybe due to samse output being in color-space like was reported here), there is no error yet but it will probably take a while.

      Thanks
      Eyal

      Comment


      • #4
        Originally posted by tcezard View Post
        Thanks, That looks like a nice bit of software.
        Just a question how does it handle CIGAR stretches of N in split read of RNA seq?

        Thanks
        Tim
        It will treat them as though they are any other base, although I did not consider RNA seq as part of its application, which makes me think...

        Originally posted by eyalbd View Post
        Thanks. this looks like a very useful piece of software!
        I tried this on BFAST output for SOLiD run and got the following error:

        If it would help I can probably look for the "ctr:200 AL:1:38:8_15_1136 50b" read in the sam file and ouput it.

        Meanwhile, I'm trying this on a BWA output for the same run which interestingly enough got me very bad pileup output (maybe due to samse output being in color-space like was reported here), there is no error yet but it will probably take a while.

        Thanks
        Eyal
        I am sorry that you encountered an error, but this is exactly what I am looking for (bugs). Could you send me the first 500 reads in the BFAST BAM and the reference? I tested it on both BWA and BFAST with Illumina and SOLiD data and for good coverage (30x: ~1 billion mapped reads on human) it performed quite well. Hopefully we can get you to the same point with good results (i.e. pileup).


        Also, you can post questions to the SRMA help mailing list ([email protected]) or even become a developer!
        Nils
        Last edited by nilshomer; 04-22-2010, 08:47 AM. Reason: more information

        Comment


        • #5
          Thanks Nils, I'll send you the first 500 lines. (could you pm me an e-mail to send it to?) are you sure you want data from the binary file and not the SAM? again I have no problem with taking the time to find the troublesome read itself, and append it as well. Good practice for my dwindling python skills, and might be more help to you.

          I'll sign up for the mailing list as well, thanks.

          Comment


          • #6
            Originally posted by eyalbd View Post
            Thanks Nils, I'll send you the first 500 lines. (could you pm me an e-mail to send it to?) are you sure you want data from the binary file and not the SAM? again I have no problem with taking the time to find the troublesome read itself, and append it as well. Good practice for my dwindling python skills, and might be more help to you.

            I'll sign up for the mailing list as well, thanks.
            Thank-you for sending me a great example from which to debug. I have released a new version (0.1.3) that fixes this bug, as well as providing an overal speed improvement. For more information, see http://srma.sf.net.

            Comment


            • #7
              Thanks to those who have tested SRMA so far. It has now been successfully run on three human genome re-sequencing experiments with great success. A manuscript is in final preparation so shoot me a PM for users who want to read what's under the hood.

              Version 0.1.5 has been released, which includes a number of cosmetic changes as well as feature requests. As always, please post questions here or via email to [email protected].
              Last edited by nilshomer; 05-14-2010, 12:47 PM.

              Comment


              • #8
                Realignment is important. I am looking forward to the publication. I am particularly interested in how realignment may improve the variant calls (I guess a lot) and how it is compared to GATK. Kees from Sanger also has a sort of realigner.

                Comment


                • #9
                  Originally posted by lh3 View Post
                  Realignment is important. I am looking forward to the publication. I am particularly interested in how realignment may improve the variant calls (I guess a lot) and how it is compared to GATK. Kees from Sanger also has a sort of realigner.
                  My understanding of GATK is that it samples from possible consensuses and that the user first must identify the regions of interest as whole genome re-alignment is not possible (yet?) with GATK. SRMA can be applied to specific regions as desired but currently it is fast enough to apply to the whole genome as it treats the alignments as priors within a variant graph.

                  I have tested the results using both BFAST and BWA on separate experiments (so I don't get into an aligner shoot-out), and the re-aligner helps reduce the false-positive rate significantly for both, especially for indels and color space data.

                  Comment


                  • #10
                    well, i know what i'll be using at 8am on Monday. this is very apropos for me at the moment, looking forward to testing this Nils.

                    Comment


                    • #11
                      Nils, how much flexibility do I have with MINIMUM_ALLELE_PROBABILITY? Itching to give this a try with some high-coverage pathogen re-sequencing data, but we are particularly interested in rare variation. Just how much of a problem am I causing by lowering this to 0.5 or 1%?

                      Also, can it handle more than two alleles at a given position?

                      Comment


                      • #12
                        Originally posted by ohofmann View Post
                        Nils, how much flexibility do I have with MINIMUM_ALLELE_PROBABILITY? Itching to give this a try with some high-coverage pathogen re-sequencing data, but we are particularly interested in rare variation. Just how much of a problem am I causing by lowering this to 0.5 or 1%?
                        For high coverage data, you probably want to raise both the "MINIMUM_ALLELE_COVERAGE" and "MINIMUM_ALLELE_PROBABILITY", since spurious coverage on variant alleles (or the reference allele) is more likely. If you have 1000x coverage, then with a 1% error rate you will see each possible allele many times. I don't know about the ploidy of your pathogen or if you are sequencing mixture. The basic idea is to set minimum thresholds on what to include as a prior variant in your new re-alignment. I have not explored high coverage data on non-diploid genomes (cancer works well too), but would be happy to help tune the parameters with you.

                        Originally posted by ohofmann View Post
                        Also, can it handle more than two alleles at a given position?
                        It can handle at most 4 plus a missing base .

                        Comment


                        • #13
                          Thanks for the swift reply ;-) Going to give it a try, we've got a nice test set to measure improvements right away. Will report back once I had a chance to tinker.

                          Comment


                          • #14
                            SRMA version 0.1.6 is now available!

                            This version has the following additions:

                            - the option MAXIMUM_TOTAL_COVERAGE will cause SRMA to ignore regions of high coverage.
                            - utilizes latest Picard release (v1-23) to traverse the reference FASTA while executing. This can dramatically improve initial start time of SRMA when processing later chromosomes since SRMA can jump straight to the reference sequence in question.
                            - adds support for soft-clipping within the SAM file (now fully compatible with the the SAM spec and BWA, I hope).
                            - RANGE/RANGES options are supported in the submission script.
                            - the NUM_THREADS option will now allow for multi-threading. This option is experimental, and may decrease performance depending on application and system architecture. On my Mac it scales as expected (linearly), but on our Cent OS 5 cluster it does not help at all.

                            Please let me know of any horrible failures or wonderful successes. I am always looking to debug and for good examples for the website. Thank-you for all those who have helped by sending examples and test cases.

                            Sincerely,

                            Nils Homer

                            Comment


                            • #15
                              Originally posted by nilshomer View Post
                              For high coverage data, you probably want to raise both the "MINIMUM_ALLELE_COVERAGE" and "MINIMUM_ALLELE_PROBABILITY", since spurious coverage on variant alleles (or the reference allele) is more likely. If you have 1000x coverage, then with a 1% error rate you will see each possible allele many times. I don't know about the ploidy of your pathogen or if you are sequencing mixture. The basic idea is to set minimum thresholds on what to include as a prior variant in your new re-alignment. I have not explored high coverage data on non-diploid genomes (cancer works well too), but would be happy to help tune the parameters with you.
                              Nils, congratulations on getting the publication out!

                              I'm about to give this a try on an odd data set -- 2kb of genomic sequence at an average (but far from uniform) coverage of around 100.000 X. It's a sequencing mixture, and the lower cutoff of variation we'd like to be able to detect is at around 0.5% (after error correction) or 500 observations.

                              Other than the biological samples we also have a mix of known genomic frequencies and defined indel regions to optimize parameters. Can you think of a realistic set of starting parameters?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X