SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Structural Variants jkozubek Bioinformatics 4 08-22-2016 07:38 AM
Pindel: improved version for indels and structural variants KaiYe Bioinformatics 147 07-17-2016 07:14 PM
Calling structural variants from capture data Heisman Bioinformatics 3 04-16-2012 07:01 AM
General mapping rate of human resequencing data against reference in GAiix/Hiseq cybog337 Illumina/Solexa 2 01-12-2011 08:43 AM
ISMB 2008 SIG on Algorithms for Next Gen Sequencing brudno Bioinformatics 0 04-14-2008 08:33 AM

Reply
 
Thread Tools
Old 03-09-2012, 07:11 AM   #1
CNVboy
Member
 
Location: boston

Join Date: Jun 2011
Posts: 27
Default General question: human CNV/Structural variants algorithms using next-gen data cannot

This is pretty much a general question in human CNV/Structural variants field (with next-gen data, NOT arrays).

As shown in 1000genome project, groups develop different algorithm-based approach to identify structural variants (mainly three algorithms: paired-end, read-depth and split-read).

However results from these approaches barely overlap with each other (of course they have different preferences, say, split-read is powerful for those small indels); and seems the false positive is quite high (or we simply don't know their false positive, because we cannot use alternative approach to validate those small structural variants like we use array CGH for large ones)

Or in simple words, I don't trust even those mainstream, or widely used approaches like Breakdancer, CNVnator (I only relatively show confidence in Pindels, because it provides nucleotide-resolution breakpoints). Do you trust them?

If not, then what should we do? To carry out some post-processing or filtering to reduce the potential false positive? For example, to adjust the read-depth threshold for read-depth-based approaches; or only limit our attention to calls supported by uniquely-mapping discordant paired-end reads for paired-end-based approaches?

Or do we need to develop our own codes for our specific research? What softwares do you guys use? (say CNVnator, Breakdancer)

Personally I would say, when someday sequencing is powerful enough to accurately produce long-enough reads, then we can say goodbye to these mapping-based methods, because we can simply assemble all reads, also in the absence of problems caused by repetitive sequences in human genome.
CNVboy is offline   Reply With Quote
Old 03-10-2012, 06:20 AM   #2
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

I usually take a 3-tiered approach, using CNVnator (read Depth), Break Dancer (Read pair) and CREST (split read). However, I too see a lot fo false positives from each tool. What would be great is if we could get a consensus from the group for how to remove these Issues.

One approach I take is that I created a SV_BLacklist file. This is from combining the Gaps, Segmental Duplications, and Repeat Mask tracks from UCSC. If either end of the SV intersects with one of these features, I remove it. Undoubtedly, this removes some true positives, but if I don't, my circos plots are full of intrachromosomal events.

Any one else what to post how they filter thier results?
RockChalkJayhawk is offline   Reply With Quote
Old 03-10-2012, 09:37 AM   #3
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default "self chain" and "mapability" data for filtering problem regions ...

For further filtering or flagging reads as problematic, you might try ..

the "self chain" track:
/http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chainSelf.sql

There's also "mapability" tracks available at the same place
example :
wgEncodeCrgMapabilityAlign75mer.sql
Richard Finney is offline   Reply With Quote
Old 03-12-2012, 02:49 AM   #4
KaiYe
Senior Member
 
Location: amsterdam

Join Date: Jun 2009
Posts: 133
Default

The main reason those tools don't overlap is because of different size ranges. Split-read like Pindel is more sensitive to small variants, from 1bp to several kb while less sensitive and with high FDR for larger variants. Read-depth like CNVnator starts to be useful for a variant larger than hundreds of bp, and the larger the variant, the more sensitive and the more reliable. Read pair like BreakDancer works well for variants larger than dozens of bp.

All methods suffer in repetitive regions, where indels and SVs occur frequently. If you care more about FDR, remove all calls overlap with repetitive regions. If you want to understand biology and want to know all the changes in your interesting samples like cancer, you may not wish to filter them just based on repetitiveness, even through your validation experiments may fail.
KaiYe is offline   Reply With Quote
Old 03-12-2012, 03:24 AM   #5
henry.wood
Member
 
Location: Leeds, UK

Join Date: Apr 2010
Posts: 63
Default

One thing I was playing around with a little while back was to try and assemble the predicted breakpoints. I took all the reads in which either pair mapped near the breakpoint and put them into velvet. When it worked it made a contig from either side of the breakpoint and one for the actual breakpoint. I didn't have time to pursue it further, but it tended to mostly agree with CREST.
henry.wood is offline   Reply With Quote
Old 03-17-2012, 10:06 AM   #6
CNVboy
Member
 
Location: boston

Join Date: Jun 2011
Posts: 27
Default

Hi Kai, you are definitely right. Pindel is quite special, which finds small indels while others cannot. So I usually never compare Pindel results with CNVnator/Breakdancer/VariationHunter. But problem is when you compare CNVnator and Breakdancer/VH, which identifies SVs with similar length, very few can overlap. This is quite frustrating.



Quote:
Originally Posted by KaiYe View Post
The main reason those tools don't overlap is because of different size ranges. Split-read like Pindel is more sensitive to small variants, from 1bp to several kb while less sensitive and with high FDR for larger variants. Read-depth like CNVnator starts to be useful for a variant larger than hundreds of bp, and the larger the variant, the more sensitive and the more reliable. Read pair like BreakDancer works well for variants larger than dozens of bp.

All methods suffer in repetitive regions, where indels and SVs occur frequently. If you care more about FDR, remove all calls overlap with repetitive regions. If you want to understand biology and want to know all the changes in your interesting samples like cancer, you may not wish to filter them just based on repetitiveness, even through your validation experiments may fail.
CNVboy is offline   Reply With Quote
Old 03-17-2012, 04:22 PM   #7
CNVboy
Member
 
Location: boston

Join Date: Jun 2011
Posts: 27
Default

Quote:
Originally Posted by henry.wood View Post
One thing I was playing around with a little while back was to try and assemble the predicted breakpoints. I took all the reads in which either pair mapped near the breakpoint and put them into velvet. When it worked it made a contig from either side of the breakpoint and one for the actual breakpoint. I didn't have time to pursue it further, but it tended to mostly agree with CREST.

Sounds interesting. Can you explain a little bit more?
One problem for assembly is, what if say the deletion is heterozygous, which means there'll still be some reads mapping to the deleted parts.
So maybe you mean we assemble all reads "outside" the calls? Since the there could be soft-clipped reads, then we can assemble them into contig which represents the real genome structure for our sample?
thx
CNVboy is offline   Reply With Quote
Old 03-18-2012, 02:35 AM   #8
Zam
Member
 
Location: Oxford

Join Date: Apr 2010
Posts: 51
Default

Speaking of assembly of SVs, you could also try Cortex (sorry for the plug, I am an author)
http://www.nature.com/ng/journal/v44...l/ng.1028.html
http://cortexassembler.sourceforge.n...ortex_var.html
Sensitivity drops with variant length (increased chance of coverage gap, plus graph complexity), so it won't assemble the v large CNVs or segdups. Roughly speaking you can call hets up to kb's in size and homs up to tens or hundreds of kb (depending on species/genome/read length). There's a lot of detail in the supp info about what you will be able to assemble for a given experiment.
Zam is offline   Reply With Quote
Old 03-18-2012, 02:45 AM   #9
Zam
Member
 
Location: Oxford

Join Date: Apr 2010
Posts: 51
Default

I should have said explicitly. There is a trade-off. Cortex assembles full alleles, giving you flank, allele1, allele2, flank, rather than just breakpoints. That's the advantage over other SV callers - it is more precise (we do validation of the exact sequence in our alleles with finished fomsids in our paper). HOWEVER, it does not have power to detect very large events (with current read-lengths). So it depends what you want to be able to detect - don't waste your time with Cortex if you want to find 200kb het duplications or segdups etc.
Zam is offline   Reply With Quote
Old 03-19-2012, 02:04 AM   #10
henry.wood
Member
 
Location: Leeds, UK

Join Date: Apr 2010
Posts: 63
Default

Quote:
Originally Posted by CNVboy View Post
Sounds interesting. Can you explain a little bit more?
One problem for assembly is, what if say the deletion is heterozygous, which means there'll still be some reads mapping to the deleted parts.
So maybe you mean we assemble all reads "outside" the calls? Since the there could be soft-clipped reads, then we can assemble them into contig which represents the real genome structure for our sample?
thx
You're right. It's a while since I did it and I've forgotten the details. I didn't use all the reads, I only used the reads were there wasn't perfect alignment. So I kept the reads where one pair aligned and the other didn't, as well as the soft clipped reads. I fiddled around with it for a little while, but then I realised I wasn't meant to be writing breakpoint algorithms, and I was only doing it in order to put off writing a talk. It cut down the list from breakdancer quite a bit, but I never got it to outperform CREST.
henry.wood is offline   Reply With Quote
Old 04-03-2012, 02:07 PM   #11
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

Do all of these algorithms work on capture data? I'm having a hard time figuring this out.
Heisman is offline   Reply With Quote
Old 04-10-2012, 04:35 AM   #12
Yilong Li
Member
 
Location: WTSI

Join Date: Dec 2010
Posts: 41
Default

@Heisman
speaking from my own experience which might be incorrect...: read depth algorithms designed for WGS data tend to be very noisy in exome data due to an additional source of noise coming from the capture step. Especially in low capture efficiency regions with low read depth the variation of logratios is considerable. I'd try the few Exome-specific read depth algorithms published recently. Breakdancer type read pairing algorithms don't work on capture data, since capture data only spans 1% or whatever portion of the genome so simplistically speaking a given SV breakpoint is spanned by your reads with just a 1% chance. algorithms designed for finding short indels (<50bp) should however work as such for detecting variants within your target regions.

@RockChalkJayhawk and others who have used BreakDancer
I'm running some 100GB whole genome Illumina sequencing files on BreakDancer. The program has been running for 2 weeks and has done 500 variants and made it to chromosome 3. Have any of you encountered this slow running times?
Yilong Li is offline   Reply With Quote
Old 04-10-2012, 04:56 AM   #13
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

@RockChalkJayhawk and others who have used BreakDancer
I'm running some 100GB whole genome Illumina sequencing files on BreakDancer. The program has been running for 2 weeks and has done 500 variants and made it to chromosome 3. Have any of you encountered this slow running times?[/QUOTE]

We parrallelize our runs by chromosome first. That should speed things up for you.
RockChalkJayhawk is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:50 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO