SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Obtaining unique sequence tag file from fastQ format ramadatta.88 Introductions 0 09-26-2011 01:25 AM
Converting Solexa FASTQ file to unique sequence tags DrD2009 Bioinformatics 16 08-08-2010 11:30 PM
Solexa - same sequence but unique identifier Layla Bioinformatics 5 11-27-2009 05:08 AM
PubMed: Mining SNPs from DNA Sequence Data; Computational Approaches to SNP Discovery Newsbot! Literature Watch 0 09-22-2009 02:00 AM
In Sequence: Illumina GA’s Deep Coverage Shown to Be Useful for Profiling, Discovery Newsbot! Illumina/Solexa 0 02-26-2008 02:20 PM

Reply
 
Thread Tools
Old 03-30-2010, 09:47 AM   #1
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default Hydra: SV discovery in unique and duplicated sequence

Hi all,

I recently released HYDRA, a new algorithm I've developed for detecting structural variation (SV) breakpoints using paired-end mapping. Similar to other algorithms, Hydra detects SV breakpoints by clustering discordant paired-end alignments whose "signatures" corroborate the same putative breakpoint. Hydra can detect breakpoints caused by all classes of structural variation. It was also designed to detect variation in both unique and duplicated genomic regions (e.g., mutations in segmental duplications and transposon insertions); therefore, it will examine paired-end reads having multiple discordant alignments.

Hydra does not align paired-end data itself. It relies upon very sensitive alignment in order to eliminate all concordant pairs prior to SV discovery. Accordingly, please look at the suggested workflow.

This algorithm is under constant development and evolves as the aligners and sequencing technologies mature, so please check the site for updates.

If interested, HYDRA was used in recent study published inGenome Research.

http://code.google.com/p/hydra-sv/

Best,
Aaron
quinlana is offline   Reply With Quote
Old 03-30-2010, 11:51 AM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by quinlana View Post
Hi all,

I recently released HYDRA, a new algorithm I've developed for detecting structural variation (SV) breakpoints using paired-end mapping. Similar to other algorithms, Hydra detects SV breakpoints by clustering discordant paired-end alignments whose "signatures" corroborate the same putative breakpoint. Hydra can detect breakpoints caused by all classes of structural variation. It was also designed to detect variation in both unique and duplicated genomic regions (e.g., mutations in segmental duplications and transposon insertions); therefore, it will examine paired-end reads having multiple discordant alignments.

Hydra does not align paired-end data itself. It relies upon very sensitive alignment in order to eliminate all concordant pairs prior to SV discovery. Accordingly, please look at the suggested workflow.

This algorithm is under constant development and evolves as the aligners and sequencing technologies mature, so please check the site for updates.

If interested, HYDRA was used in recent study published inGenome Research.

http://code.google.com/p/hydra-sv/

Best,
Aaron
Are there plans for a paper? It would be very useful for the community to have a detailed explanation. Great work!
nilshomer is offline   Reply With Quote
Old 03-30-2010, 01:10 PM   #3
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Hi Nils,

The general method is covered in the Methods/SuppMeth section of the GR paper. That said, a proper paper is in the works but will probably be held up until some of the newer approaches I am working on are complete. I'd also like to add direct BAM support, as well as a split-read component.

The major bottleneck with the approach is sensitive alignment. If the alignments are crap the calls will be too. Perhaps not surprisingly (not surprising to you, given that you have apparently written a sensitive aligner), 99% of the time spent in the outlined workflow is on alignments screening for concordant pairs that are difficult to find owing to polymorphism and sequence error.

Take care,
Aaron

Last edited by quinlana; 03-30-2010 at 04:42 PM. Reason: typo/clarity of meaning
quinlana is offline   Reply With Quote
Old 05-10-2010, 09:00 PM   #4
bawee
Junior Member
 
Location: Australia

Join Date: Feb 2010
Posts: 4
Default

Hi Aaron,

In your documentation, the first step seems to be to determine the median insert size and M.A.D. How do you do this? Should I run a preliminary alignment with say, MAQ and find the summary from the .log file? Then again, this only gives the mean and S.D and not the median.


Cheers

Bryan
bawee is offline   Reply With Quote
Old 05-11-2010, 01:38 PM   #5
Margarida
Junior Member
 
Location: Ithaca, NY

Join Date: Jan 2010
Posts: 6
Default

Hi Aaron,

I am very impressed about how Hydra deals with finding CNVs in duplicate regions. You made some very cool novel observations in your Genome Research paper.

Can Hydra also work with Illumina's mate-pair reads? If not, would this be something you would try to implement in the future?

Thanks!

P.S. Huge fan of BEDTools...
Margarida is offline   Reply With Quote
Old 05-11-2010, 04:25 PM   #6
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Quote:
Originally Posted by bawee View Post
Hi Aaron,

In your documentation, the first step seems to be to determine the median insert size and M.A.D. How do you do this? Should I run a preliminary alignment with say, MAQ and find the summary from the .log file? Then again, this only gives the mean and S.D and not the median.


Cheers

Bryan
Hi Bryan,
Yes, I gloss over this step though it is important. The main reason for the vagueness is that each lab seems to have different opinions about how this should be done. One should extract the insert sizes from a BAM file, export file or MAQ output and then use R, Matlab, Perl, whathaveyou, for calculating median and MAD. I use slightly different criteria for which pairs to include in this calculation, depending on the experiment. In my mind, the main point here is that standard deviation can often grossly overstate the variability in a library owing to large outliers (as stdev is the sqrt of the variability). MAD is more stable in this regard.

I hope this helps.
Aaron
quinlana is offline   Reply With Quote
Old 05-11-2010, 04:34 PM   #7
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Quote:
Originally Posted by Margarida View Post
Hi Aaron,

I am very impressed about how Hydra deals with finding CNVs in duplicate regions. You made some very cool novel observations in your Genome Research paper.

Can Hydra also work with Illumina's mate-pair reads? If not, would this be something you would try to implement in the future?

Thanks!

P.S. Huge fan of BEDTools...

Hi Margarida,
Hydra makes no assumptions about the protocol used to generate your sequence data. Hydra requires that you exclude all concordant paired-ends or matepairs prior to execution. What should be input to Hydra are all of the non-duplicate discordant mappings: from these mappings Hydra will cluster similarly discordant mappings and report aberrant breakpoints. The onus then falls on the user to determine if a given breakpoint suggests a deletion (i.e., excessive distance with +/- for paired-end, -/+ for matepair), duplication, etc.

That said, currently the hard part with matepairs is alignment. Some aligners handle matepairs, some don't. The approach I have been using is to reverse complement the fastq files (reverse the qualities) prior to alignment. Upon doing so, all aligners are fair game as the reads now smell like paired-end. In the case of Illumina reads, the downside is that the lower quality bases are now the "seed" for BW aligners. To get around this, I use a tiered approach as described on the Hydra website. First align with BWA or Bowtie and assume that truly concordant pairs will be missed for the reason above. Then take all remaining discordant matepairs and align them with more sensitive aligners that examine all possible kmer seeds (e.g., Mosaik, Novoalign, BFAST, others).

I hope this helps.
Aaron
quinlana is offline   Reply With Quote
Old 05-12-2010, 06:26 AM   #8
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by quinlana View Post
Hi Bryan,
Yes, I gloss over this step though it is important. The main reason for the vagueness is that each lab seems to have different opinions about how this should be done. One should extract the insert sizes from a BAM file, export file or MAQ output and then use R, Matlab, Perl, whathaveyou, for calculating median and MAD. I use slightly different criteria for which pairs to include in this calculation, depending on the experiment. In my mind, the main point here is that standard deviation can often grossly overstate the variability in a library owing to large outliers (as stdev is the sqrt of the variability). MAD is more stable in this regard.

I hope this helps.
Aaron
Do you think 99% threshold cut-offs (or whatever value you choose) rather than MAD would be even better?

That was my feeling, at least. Neither MAD nor stdev is going to do a particularly good job with non-normal skewed/bimodal distributions, which is what LMP SOLiD datasets, at least, usually look like (though MAD should do better than stdev). I'm not sure what your test sets have been with Hydra, but MAD was probably fine if the distribution of paired-end distances approached a normal distribution/wasn't heavily skewed or bimodal in those sets. However, in LMP data like that I've dealt with, I think you're really going hurt your accuracy using MAD (or stdev, for that matter). Food for thought.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]

Last edited by Michael.James.Clark; 05-12-2010 at 06:30 AM.
Michael.James.Clark is offline   Reply With Quote
Old 08-20-2010, 10:03 AM   #9
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Hi all,

I just posted version 0.5.3 of Hydra (http://code.google.com/p/hydra-sv/). This release corrects a couple of bugs and provides an improved tool for removing duplicate mappings --- especially those arising from repetitive sequence such as segmental duplications and transposons. As a result, there are fewer false positive breakpoint calls.

I am now focusing on new features such as support for multiple sequence libraries/samples and split-read breakpoint detection.

Best,
Aaron
quinlana is offline   Reply With Quote
Old 08-24-2010, 07:21 PM   #10
tinacai
Member
 
Location: china

Join Date: Apr 2010
Posts: 18
Default

Hi ,all
I have use hydra software
Tier 2 alignment. Grab the discordant alignments from the tier 1 BAM files and create FASTQ files for the discordant pairs. Align the tier 1 discordant pairs with a more sensitive aligner such as Novoalign or Mosaik,I use Mosaik software ,but when I build human genome index using MosaikBuild I can't run it ,so I want to see if it run the human genome ,who have the Novoalign software,if can send me a copy
tinacai is offline   Reply With Quote
Old 08-25-2010, 02:29 AM   #11
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Hi Everybody,

Just a heads up that novoalign as of V2.07.00 now handles Illumina mate pairs so you wont be needing to reverse complement those FASTQ files. Get the latest version at www.novocraft.com

Quote:
That said, currently the hard part with matepairs is alignment. Some aligners handle matepairs, some don't. The approach I have been using is to reverse complement the fastq files (reverse the qualities) prior to alignment. Upon doing so, all aligners are fair game as the reads now smell like paired-end. In the case of Illumina reads, the downside is that the lower quality bases are now the "seed" for BW aligners. To get around this, I use a tiered approach as described on the Hydra website. First align with BWA or Bowtie and assume that truly concordant pairs will be missed for the reason above. Then take all remaining discordant matepairs and align them with more sensitive aligners that examine all possible kmer seeds (e.g., Mosaik, Novoalign, BFAST, others).
zee is offline   Reply With Quote
Old 08-25-2010, 03:02 AM   #12
tinacai
Member
 
Location: china

Join Date: Apr 2010
Posts: 18
Default

Hi ZEE,
I want to download novoalign software from www.novocraft.com ,but I can't open the website,so if you have a local version if can send me a copy ,thanks very much.My E-mail "wzyxy2012@yahoo.cn"
tinacai is offline   Reply With Quote
Old 08-25-2010, 04:27 AM   #13
tinacai
Member
 
Location: china

Join Date: Apr 2010
Posts: 18
Default

Hi,
I use http://code.google.com/p/hydra-sv/wiki/TypicalWorkflow linkage,and run the hydra ,when I arrived at the last step which use BEDtools to annotation the structure variation that I have no idea.I really hope that somebody can help me ,I shoud use which one program in BEDtools to annotation my hydra result if I only want to fing the deletion and insertion
tinacai is offline   Reply With Quote
Old 08-25-2010, 04:59 AM   #14
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Quote:
Originally Posted by tinacai View Post
Hi,
I use http://code.google.com/p/hydra-sv/wiki/TypicalWorkflow linkage,and run the hydra ,when I arrived at the last step which use BEDtools to annotation the structure variation that I have no idea.I really hope that somebody can help me ,I shoud use which one program in BEDtools to annotation my hydra result if I only want to fing the deletion and insertion
Unfortunately, there is a not a simple answer to this. The basics are as follows (assuming paired-end Illumina reads):


1. Intra-chromosomal Hydra breakpoints with +/- orientation and a size >> library size are typically deletions in the test genome or transposon insertions that occurred in the reference genome. One can discern the two by using the pairToBed program and asking for substantial overlap b/w the inner span of a breakpoint and existing transposon annotations.


2. Intra-chromosomal Hydra breakpoints with +/- orientation and a size << library size are typically insertions in the test genome or transposon insertions that occurred in the reference genome.



3. Intra-chromosomal Hydra breakpoints with -/- or +/+ orientation are typically inversion breakpoints in the test genome.


4. Inter-chromosomal Hydra breakpoints are typically either transposon insertions in the test genome, translocations in the test genome, chimeric molecules, or mapping artifacts. Transposon insertions can be identified with pairToBed while asking that one end of the breakpoint overlaps with an annotated and recent transposon. Reciprocal translocations should have a symmetric breakpoints between the two chromosomes involved. Chimeras are unavoidable but most folks require at least two supporting readpairs to eliminate them under the assumption that they occur randomly (yet they don't always). Mapping artifacts also happen, but this is why we suggest aligning with the most sensitive settings possible to avoid pairs that appear to be discordant solely because the aligner could not find the concordant alignment.


I hope this helps. I am planning to write a script this fall that will help with this process. The difficulty is that the structure of many SV loci are not easily described by the rules above (see the Hydra GR paper for examples). In short, many SV loci are incredibly complex.

Aaron
quinlana is offline   Reply With Quote
Old 08-25-2010, 06:24 PM   #15
tinacai
Member
 
Location: china

Join Date: Apr 2010
Posts: 18
Default

Hi,Quinlana;
I have no idea about the last step yet,in your paper workflow,incluing use BWA NOVOALIGN ,HYDRA and BEDTOOLS ,in http://code.google.com/p/hydra-sv/wiki/TypicalWorkflow linkage,I think it is not a good guidelines
tinacai is offline   Reply With Quote
Old 08-26-2010, 05:04 AM   #16
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Quote:
Originally Posted by tinacai View Post
Hi,Quinlana;
I have no idea about the last step yet,in your paper workflow,incluing use BWA NOVOALIGN ,HYDRA and BEDTOOLS ,in http://code.google.com/p/hydra-sv/wiki/TypicalWorkflow linkage,I think it is not a good guidelines
Thank you for letting me know. When I find some time I will try to make the approach more intuitive.
quinlana is offline   Reply With Quote
Old 08-30-2010, 09:05 AM   #17
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Perhaps a python/shell script or a Makefile would simplify things. Any plans for this in the future? Perhaps I could help out on this task since we are working on improving use cases for our aligner.
zee is offline   Reply With Quote
Old 08-30-2010, 09:18 AM   #18
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Hi Zee,
Thanks for the suggestion. However, I am reticent to automate anything about the process as it very much depends on the quality and variability of the user data, the organism, etc. I think it's best for me to merely improve the documentation and give thorough explanations of why things are done the way they are.

Aaron
quinlana is offline   Reply With Quote
Old 09-16-2010, 09:08 AM   #19
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

Hi all,
I just posted a script on the Hydra site that allows one to convert the Hydra breakpoint calls (in BEDPE format) to BED12 for visualization on IGV, UCSC, etc. I've had several requests for such a tool and finally got around to doing it.

http://code.google.com/p/hydra-sv/do...edpeToBed12.py

Best,
Aaron
quinlana is offline   Reply With Quote
Old 02-08-2011, 09:08 AM   #20
wdt
Member
 
Location: Southwest

Join Date: Oct 2009
Posts: 19
Default

Aaron,

This is a samtools question but since the o/p is going into Hydra, I thought of posting here.

Following the workflow on http://code.google.com/p/hydra-sv/wiki/TypicalWorkflow, for extracting the discordant reads using

samtools view -uF 2 sample.tier1.bam | \
bamToFastq -bam stdin \
-fq1 sample.tier1.disc.1.fq \
-fq2 sample.tier1.disc.2.fq

Should we see exact same number of reads (and identical pairwise read Ids) in the 1.fq and 2.fq file?

For 1.fq and 2.fq files I have, I don't see corresponding match of read IDs. Do
you require BAM to be sorted based on read id?


Thanks in advance.

Last edited by wdt; 02-08-2011 at 11:17 AM.
wdt is offline   Reply With Quote
Reply

Tags
hydra, structural-variation

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:31 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO