Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hydra: SV discovery in unique and duplicated sequence

    Hi all,

    I recently released HYDRA, a new algorithm I've developed for detecting structural variation (SV) breakpoints using paired-end mapping. Similar to other algorithms, Hydra detects SV breakpoints by clustering discordant paired-end alignments whose "signatures" corroborate the same putative breakpoint. Hydra can detect breakpoints caused by all classes of structural variation. It was also designed to detect variation in both unique and duplicated genomic regions (e.g., mutations in segmental duplications and transposon insertions); therefore, it will examine paired-end reads having multiple discordant alignments.

    Hydra does not align paired-end data itself. It relies upon very sensitive alignment in order to eliminate all concordant pairs prior to SV discovery. Accordingly, please look at the suggested workflow.

    This algorithm is under constant development and evolves as the aligners and sequencing technologies mature, so please check the site for updates.

    If interested, HYDRA was used in recent study published inGenome Research.



    Best,
    Aaron

  • #2
    Originally posted by quinlana View Post
    Hi all,

    I recently released HYDRA, a new algorithm I've developed for detecting structural variation (SV) breakpoints using paired-end mapping. Similar to other algorithms, Hydra detects SV breakpoints by clustering discordant paired-end alignments whose "signatures" corroborate the same putative breakpoint. Hydra can detect breakpoints caused by all classes of structural variation. It was also designed to detect variation in both unique and duplicated genomic regions (e.g., mutations in segmental duplications and transposon insertions); therefore, it will examine paired-end reads having multiple discordant alignments.

    Hydra does not align paired-end data itself. It relies upon very sensitive alignment in order to eliminate all concordant pairs prior to SV discovery. Accordingly, please look at the suggested workflow.

    This algorithm is under constant development and evolves as the aligners and sequencing technologies mature, so please check the site for updates.

    If interested, HYDRA was used in recent study published inGenome Research.



    Best,
    Aaron
    Are there plans for a paper? It would be very useful for the community to have a detailed explanation. Great work!

    Comment


    • #3
      Hi Nils,

      The general method is covered in the Methods/SuppMeth section of the GR paper. That said, a proper paper is in the works but will probably be held up until some of the newer approaches I am working on are complete. I'd also like to add direct BAM support, as well as a split-read component.

      The major bottleneck with the approach is sensitive alignment. If the alignments are crap the calls will be too. Perhaps not surprisingly (not surprising to you, given that you have apparently written a sensitive aligner), 99% of the time spent in the outlined workflow is on alignments screening for concordant pairs that are difficult to find owing to polymorphism and sequence error.

      Take care,
      Aaron
      Last edited by quinlana; 03-30-2010, 04:42 PM. Reason: typo/clarity of meaning

      Comment


      • #4
        Hi Aaron,

        In your documentation, the first step seems to be to determine the median insert size and M.A.D. How do you do this? Should I run a preliminary alignment with say, MAQ and find the summary from the .log file? Then again, this only gives the mean and S.D and not the median.


        Cheers

        Bryan

        Comment


        • #5
          Hi Aaron,

          I am very impressed about how Hydra deals with finding CNVs in duplicate regions. You made some very cool novel observations in your Genome Research paper.

          Can Hydra also work with Illumina's mate-pair reads? If not, would this be something you would try to implement in the future?

          Thanks!

          P.S. Huge fan of BEDTools...

          Comment


          • #6
            Originally posted by bawee View Post
            Hi Aaron,

            In your documentation, the first step seems to be to determine the median insert size and M.A.D. How do you do this? Should I run a preliminary alignment with say, MAQ and find the summary from the .log file? Then again, this only gives the mean and S.D and not the median.


            Cheers

            Bryan
            Hi Bryan,
            Yes, I gloss over this step though it is important. The main reason for the vagueness is that each lab seems to have different opinions about how this should be done. One should extract the insert sizes from a BAM file, export file or MAQ output and then use R, Matlab, Perl, whathaveyou, for calculating median and MAD. I use slightly different criteria for which pairs to include in this calculation, depending on the experiment. In my mind, the main point here is that standard deviation can often grossly overstate the variability in a library owing to large outliers (as stdev is the sqrt of the variability). MAD is more stable in this regard.

            I hope this helps.
            Aaron

            Comment


            • #7
              Originally posted by Margarida View Post
              Hi Aaron,

              I am very impressed about how Hydra deals with finding CNVs in duplicate regions. You made some very cool novel observations in your Genome Research paper.

              Can Hydra also work with Illumina's mate-pair reads? If not, would this be something you would try to implement in the future?

              Thanks!

              P.S. Huge fan of BEDTools...

              Hi Margarida,
              Hydra makes no assumptions about the protocol used to generate your sequence data. Hydra requires that you exclude all concordant paired-ends or matepairs prior to execution. What should be input to Hydra are all of the non-duplicate discordant mappings: from these mappings Hydra will cluster similarly discordant mappings and report aberrant breakpoints. The onus then falls on the user to determine if a given breakpoint suggests a deletion (i.e., excessive distance with +/- for paired-end, -/+ for matepair), duplication, etc.

              That said, currently the hard part with matepairs is alignment. Some aligners handle matepairs, some don't. The approach I have been using is to reverse complement the fastq files (reverse the qualities) prior to alignment. Upon doing so, all aligners are fair game as the reads now smell like paired-end. In the case of Illumina reads, the downside is that the lower quality bases are now the "seed" for BW aligners. To get around this, I use a tiered approach as described on the Hydra website. First align with BWA or Bowtie and assume that truly concordant pairs will be missed for the reason above. Then take all remaining discordant matepairs and align them with more sensitive aligners that examine all possible kmer seeds (e.g., Mosaik, Novoalign, BFAST, others).

              I hope this helps.
              Aaron

              Comment


              • #8
                Originally posted by quinlana View Post
                Hi Bryan,
                Yes, I gloss over this step though it is important. The main reason for the vagueness is that each lab seems to have different opinions about how this should be done. One should extract the insert sizes from a BAM file, export file or MAQ output and then use R, Matlab, Perl, whathaveyou, for calculating median and MAD. I use slightly different criteria for which pairs to include in this calculation, depending on the experiment. In my mind, the main point here is that standard deviation can often grossly overstate the variability in a library owing to large outliers (as stdev is the sqrt of the variability). MAD is more stable in this regard.

                I hope this helps.
                Aaron
                Do you think 99% threshold cut-offs (or whatever value you choose) rather than MAD would be even better?

                That was my feeling, at least. Neither MAD nor stdev is going to do a particularly good job with non-normal skewed/bimodal distributions, which is what LMP SOLiD datasets, at least, usually look like (though MAD should do better than stdev). I'm not sure what your test sets have been with Hydra, but MAD was probably fine if the distribution of paired-end distances approached a normal distribution/wasn't heavily skewed or bimodal in those sets. However, in LMP data like that I've dealt with, I think you're really going hurt your accuracy using MAD (or stdev, for that matter). Food for thought.
                Last edited by Michael.James.Clark; 05-12-2010, 06:30 AM.
                Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                Projects: U87MG whole genome sequence [Website] [Paper]

                Comment


                • #9
                  Hi all,

                  I just posted version 0.5.3 of Hydra (http://code.google.com/p/hydra-sv/). This release corrects a couple of bugs and provides an improved tool for removing duplicate mappings --- especially those arising from repetitive sequence such as segmental duplications and transposons. As a result, there are fewer false positive breakpoint calls.

                  I am now focusing on new features such as support for multiple sequence libraries/samples and split-read breakpoint detection.

                  Best,
                  Aaron

                  Comment


                  • #10
                    Hi ,all
                    I have use hydra software
                    Tier 2 alignment. Grab the discordant alignments from the tier 1 BAM files and create FASTQ files for the discordant pairs. Align the tier 1 discordant pairs with a more sensitive aligner such as Novoalign or Mosaik,I use Mosaik software ,but when I build human genome index using MosaikBuild I can't run it ,so I want to see if it run the human genome ,who have the Novoalign software,if can send me a copy

                    Comment


                    • #11
                      Hi Everybody,

                      Just a heads up that novoalign as of V2.07.00 now handles Illumina mate pairs so you wont be needing to reverse complement those FASTQ files. Get the latest version at www.novocraft.com

                      That said, currently the hard part with matepairs is alignment. Some aligners handle matepairs, some don't. The approach I have been using is to reverse complement the fastq files (reverse the qualities) prior to alignment. Upon doing so, all aligners are fair game as the reads now smell like paired-end. In the case of Illumina reads, the downside is that the lower quality bases are now the "seed" for BW aligners. To get around this, I use a tiered approach as described on the Hydra website. First align with BWA or Bowtie and assume that truly concordant pairs will be missed for the reason above. Then take all remaining discordant matepairs and align them with more sensitive aligners that examine all possible kmer seeds (e.g., Mosaik, Novoalign, BFAST, others).

                      Comment


                      • #12
                        Hi ZEE,
                        I want to download novoalign software from www.novocraft.com ,but I can't open the website,so if you have a local version if can send me a copy ,thanks very much.My E-mail "[email protected]"

                        Comment


                        • #13
                          Hi,
                          I use http://code.google.com/p/hydra-sv/wiki/TypicalWorkflow linkage,and run the hydra ,when I arrived at the last step which use BEDtools to annotation the structure variation that I have no idea.I really hope that somebody can help me ,I shoud use which one program in BEDtools to annotation my hydra result if I only want to fing the deletion and insertion

                          Comment


                          • #14
                            Originally posted by tinacai View Post
                            Hi,
                            I use http://code.google.com/p/hydra-sv/wiki/TypicalWorkflow linkage,and run the hydra ,when I arrived at the last step which use BEDtools to annotation the structure variation that I have no idea.I really hope that somebody can help me ,I shoud use which one program in BEDtools to annotation my hydra result if I only want to fing the deletion and insertion
                            Unfortunately, there is a not a simple answer to this. The basics are as follows (assuming paired-end Illumina reads):


                            1. Intra-chromosomal Hydra breakpoints with +/- orientation and a size >> library size are typically deletions in the test genome or transposon insertions that occurred in the reference genome. One can discern the two by using the pairToBed program and asking for substantial overlap b/w the inner span of a breakpoint and existing transposon annotations.


                            2. Intra-chromosomal Hydra breakpoints with +/- orientation and a size << library size are typically insertions in the test genome or transposon insertions that occurred in the reference genome.



                            3. Intra-chromosomal Hydra breakpoints with -/- or +/+ orientation are typically inversion breakpoints in the test genome.


                            4. Inter-chromosomal Hydra breakpoints are typically either transposon insertions in the test genome, translocations in the test genome, chimeric molecules, or mapping artifacts. Transposon insertions can be identified with pairToBed while asking that one end of the breakpoint overlaps with an annotated and recent transposon. Reciprocal translocations should have a symmetric breakpoints between the two chromosomes involved. Chimeras are unavoidable but most folks require at least two supporting readpairs to eliminate them under the assumption that they occur randomly (yet they don't always). Mapping artifacts also happen, but this is why we suggest aligning with the most sensitive settings possible to avoid pairs that appear to be discordant solely because the aligner could not find the concordant alignment.


                            I hope this helps. I am planning to write a script this fall that will help with this process. The difficulty is that the structure of many SV loci are not easily described by the rules above (see the Hydra GR paper for examples). In short, many SV loci are incredibly complex.

                            Aaron

                            Comment


                            • #15
                              Hi,Quinlana;
                              I have no idea about the last step yet,in your paper workflow,incluing use BWA NOVOALIGN ,HYDRA and BEDTOOLS ,in http://code.google.com/p/hydra-sv/wiki/TypicalWorkflow linkage,I think it is not a good guidelines

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X