SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
variant caller that can accomodate mosaicism? ymc Bioinformatics 0 08-21-2012 05:46 PM
any recommandation on variant caller for SOLiD data ? pandafengye Genomic Resequencing 1 04-25-2012 08:22 AM
[NGS - analysis of gene expression data] Machine Learning + RNAseq data Chuckytah Bioinformatics 7 03-05-2012 04:16 AM
PhD position in machine learning and oncoviral genomics at CBS, Denmark tsp Academic/Non-Profit Jobs 0 12-10-2011 01:53 AM
stanford's machine learning applied in bioinformatics delinquentme Bioinformatics 6 11-23-2011 06:19 AM

Reply
 
Thread Tools
Old 12-18-2012, 11:25 AM   #1
brofallon
Member
 
Location: United States

Join Date: May 2011
Posts: 26
Default SNPSVM : An accurate, machine-learning based variant caller

Howdy all,
I've been working on a support-vector machine based variant caller, and it's finally at the point where others may find it useful. It calls variants from .BAM or .SAM files in a manner similar to samtools / mpileup or the GATK, and does so with greater accuracy than most other variants callers.
Here's a few quick facts:
  • More accurate than GATK, samtools, or FreeBayes
  • SNPS only right now
  • Parallelizes easily
  • Uses a support-vector machine to separate true from false positive variants
  • About as fast as the GATK, very low memory requirements
  • Can 'learn' from additional training data and become more accurate over time

Source code and downloads available from github

Feel free to check out an early version of the manuscript

One requirement : It depends on libsvm. (libsvm is also available in many linux repositories, try your package manager first)

Happy SNP calling.
brofallon is offline   Reply With Quote
Old 12-19-2012, 04:01 AM   #2
ekg
Member
 
Location: Boston, MA

Join Date: Apr 2010
Posts: 36
Default

Can SNPSVM be used to call SNPs in multiple samples? If so, I'll include it in the comparisons I've developed for multi-sample variant detection.

When you say "more accurate" do you mean lower FP rate? Is the sensitivity comparable to these other methods?
ekg is offline   Reply With Quote
Old 12-19-2012, 07:12 AM   #3
brofallon
Member
 
Location: United States

Join Date: May 2011
Posts: 26
Default

SNPSVM has fewer false positives per true positive call than other tools - about 50% fewer for the data sets I've been examining (mostly NA12878). For pretty much any level of sensitivity it has greater specificity. The manuscript has some ROC curves I've constructed for SNPSVM vs. GATK, samtools, and FreeBayes that provide a nice visual comparison.
brofallon is offline   Reply With Quote
Old 12-19-2012, 07:34 AM   #4
ekg
Member
 
Location: Boston, MA

Join Date: Apr 2010
Posts: 36
Default

Oh! I see. I should have examined the manuscript. Nice work!

A few comments, mostly focused on the comparison in the draft:

When comparing the results of the various variant callers, how did you handle variants that weren't called as SNPs (such as MNPs, or small complex variants)? As these considered false negatives?

I think the y-axis should be "sensitivity," not "specificity." Is that right?

If you are going to use freebayes in the research, please cite it using the current preprint version (http://arxiv.org/abs/1207.3907).
ekg is offline   Reply With Quote
Old 12-19-2012, 07:56 AM   #5
brofallon
Member
 
Location: United States

Join Date: May 2011
Posts: 26
Default

Thanks - indeed the labels on that figure should be switched (I always reverse 'em...), and I'll be sure to cite FreeBayes.
I should have mentioned there's a default 'model' file on the github site that seems to do a fine job for Illumina data, so there's no need to generate a new model if you just want to try it. The model is based on about 35,000 true / false calls across a handful of Illumina exome samples and probably yields somewhat more accurate results than those given in the manuscript, since it includes NA12878 in the training data.
brofallon is offline   Reply With Quote
Old 12-19-2012, 10:00 AM   #6
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default rna-seq data?

do you think the calling will work also with RNA-seq data, perhaps after building a own model from RNA-seq data.

it seems the program need also an fa-index (which i prepared by samtools faidx).
how should the BAM file be organized (i.e. sorted)?

Nevertheless, I got following error:
Code:
java -Xmx10g -jar $SNPVSM/snpsvm.jar predict -R $fa -B $BAM -M $model -V $outdir/WBC_rum2.vcf
Loading module class snpsvm.app.Predictor
Scanning contigs in fasta file....done, found 83 contigs
Exception in thread "main" java.lang.IllegalArgumentException: Could not find contig 2 in reference
at snpsvm.app.Predictor.validateIntervals(Predictor.java:158)
at snpsvm.app.Predictor.callSNPs(Predictor.java:178)
at snpsvm.app.Predictor.performOperation(Predictor.java:132)
at snpsvm.app.CommandLineApp.main(CommandLineApp.java:55)

the fasta.fai looks like this:
GL000207.1 4262 71 60 61
GL000226.1 15008 4477 60 61
GL000229.1 19913 19808 60 61
GL000231.1 27386 40125 60 61
GL000210.1 27682 68040 60 61
GL000239.1 33824 96256 60 61
GL000235.1 34474 130716 60 61
GL000201.1 36148 165837 60 61
GL000247.1 36422 202660 60 61
GL000245.1 36651 239762 60 61
GL000197.1 37175 277096 60 61
GL000203.1 37498 314963 60 61
GL000246.1 38154 353158 60 61
GL000249.1 38502 392020 60 61
GL000196.1 38914 431236 60 61
GL000248.1 39786 470871 60 61
GL000244.1 39929 511393 60 61
GL000238.1 39939 552060 60 61
GL000202.1 40103 592737 60 61
GL000234.1 40531 633581 60 61
GL000232.1 40652 674860 60 61
GL000206.1 41001 716262 60 61
GL000240.1 41933 758019 60 61
GL000236.1 41934 800723 60 61
GL000241.1 42152 843428 60 61
GL000243.1 43341 886355 60 61
GL000242.1 43523 930491 60 61
GL000230.1 43691 974812 60 61
GL000237.1 45867 1019304 60 61
GL000233.1 45941 1066008 60 61
GL000204.1 81310 1112787 60 61
GL000198.1 90085 1195525 60 61
GL000208.1 92689 1287184 60 61
GL000191.1 106433 1381491 60 61
GL000227.1 128374 1489771 60 61
GL000228.1 129120 1620358 60 61
GL000214.1 137718 1751703 60 61
GL000221.1 155397 1891790 60 61
GL000209.1 159169 2049850 60 61
GL000218.1 161147 2211745 60 61
GL000220.1 161802 2375651 60 61
GL000213.1 164239 2540223 60 61
GL000211.1 166566 2707273 60 61
GL000199.1 169874 2876689 60 61
GL000217.1 172149 3049468 60 61
GL000216.1 172294 3224560 60 61
GL000215.1 172545 3399799 60 61
GL000205.1 174588 3575293 60 61
GL000219.1 179198 3752864 60 61
GL000224.1 179693 3935122 60 61
GL000223.1 180455 4117883 60 61
GL000195.1 182896 4301419 60 61
GL000212.1 186858 4487437 60 61
GL000222.1 186861 4677483 60 61
GL000200.1 187035 4867532 60 61
GL000193.1 189789 5057758 60 61
GL000194.1 191469 5250784 60 61
GL000225.1 211173 5445518 60 61
GL000192.1 547496 5660284 60 61
MT 16569 6216959 60 61
Y 59373566 6233866 60 61
22 51304566 66597049 60 61
21 48129895 118756749 60 61
19 59128983 167688866 60 61
20 63025520 227803390 60 61
18 78077248 291879393 60 61
17 81195210 371257986 60 61
16 90354753 453806507 60 61
15 102531392 545667231 60 61
14 107349540 649907538 60 61
13 115169878 759046295 60 61
9 141213431 876135727 60 61
12 133851895 1019702774 60 61
11 135006516 1155785592 60 61
10 135534747 1293042275 60 61
8 146364022 1430835991 60 61
X 155270560 1579639470 60 61
7 159138663 1737497929 60 61
6 171115067 1899288960 60 61
5 180915260 2073256001 60 61
4 191154276 2257186572 60 61
3 198022430 2451526809 60 61
1 249250621 2652849669 60 61
2 243199373 2906254524 60 61
dietmar13 is offline   Reply With Quote
Old 12-19-2012, 02:29 PM   #7
brofallon
Member
 
Location: United States

Join Date: May 2011
Posts: 26
Default

Hi there,
I haven't tried to do anything with RNA-seq yet, so I'm not sure how well it would work. I think you're right that training on a RNA-seq specific dataset would probably be the best idea.
Looks like there was a bug in the interval-validation stuff that cropped up in runs without intervals specified. I think I fixed it - try the new version on github and it should work ok now.
brofallon is offline   Reply With Quote
Old 12-20-2012, 01:55 AM   #8
ekg
Member
 
Location: Boston, MA

Join Date: Apr 2010
Posts: 36
Default

Another thought about the figure in the paper. I've done a lot of testing with GATK, freebayes, and samtools, and I've never seen samtools generate a curve like that in the figure.

Also, I'm certain you'll get drop-outs from MNPs and small haplotypes at a rate that basically explains the difference between freebayes and GATK in sensitivity (2-3%). If you ran the freebayes output through vcfallelicprimitives you'll get the decomposed SNP in a format consistent with the other callers. The different callers have different default definitions of what a variant is.
ekg is offline   Reply With Quote
Old 12-21-2012, 09:46 AM   #9
brofallon
Member
 
Location: United States

Join Date: May 2011
Posts: 26
Default

Thanks again, ekg. The funkiness in the samtools ROC curve was due to indels that I thought I had removed, but actually hadn't. It's corrected now and overall the curve seems very similar to FreeBayes and the UnifiedGenotyper - funny how little variation there is in those tools for single-sample SNP calling.
brofallon is offline   Reply With Quote
Reply

Tags
machine learning, snp, support vector machine, variant calling

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:25 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO