Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SNPSVM : An accurate, machine-learning based variant caller

    Howdy all,
    I've been working on a support-vector machine based variant caller, and it's finally at the point where others may find it useful. It calls variants from .BAM or .SAM files in a manner similar to samtools / mpileup or the GATK, and does so with greater accuracy than most other variants callers.
    Here's a few quick facts:
    • More accurate than GATK, samtools, or FreeBayes
    • SNPS only right now
    • Parallelizes easily
    • Uses a support-vector machine to separate true from false positive variants
    • About as fast as the GATK, very low memory requirements
    • Can 'learn' from additional training data and become more accurate over time


    Source code and downloads available from github

    Feel free to check out an early version of the manuscript

    One requirement : It depends on libsvm. (libsvm is also available in many linux repositories, try your package manager first)

    Happy SNP calling.

  • #2
    Can SNPSVM be used to call SNPs in multiple samples? If so, I'll include it in the comparisons I've developed for multi-sample variant detection.

    When you say "more accurate" do you mean lower FP rate? Is the sensitivity comparable to these other methods?

    Comment


    • #3
      SNPSVM has fewer false positives per true positive call than other tools - about 50% fewer for the data sets I've been examining (mostly NA12878). For pretty much any level of sensitivity it has greater specificity. The manuscript has some ROC curves I've constructed for SNPSVM vs. GATK, samtools, and FreeBayes that provide a nice visual comparison.

      Comment


      • #4
        Oh! I see. I should have examined the manuscript. Nice work!

        A few comments, mostly focused on the comparison in the draft:

        When comparing the results of the various variant callers, how did you handle variants that weren't called as SNPs (such as MNPs, or small complex variants)? As these considered false negatives?

        I think the y-axis should be "sensitivity," not "specificity." Is that right?

        If you are going to use freebayes in the research, please cite it using the current preprint version (http://arxiv.org/abs/1207.3907).

        Comment


        • #5
          Thanks - indeed the labels on that figure should be switched (I always reverse 'em...), and I'll be sure to cite FreeBayes.
          I should have mentioned there's a default 'model' file on the github site that seems to do a fine job for Illumina data, so there's no need to generate a new model if you just want to try it. The model is based on about 35,000 true / false calls across a handful of Illumina exome samples and probably yields somewhat more accurate results than those given in the manuscript, since it includes NA12878 in the training data.

          Comment


          • #6
            rna-seq data?

            do you think the calling will work also with RNA-seq data, perhaps after building a own model from RNA-seq data.

            it seems the program need also an fa-index (which i prepared by samtools faidx).
            how should the BAM file be organized (i.e. sorted)?

            Nevertheless, I got following error:
            Code:
            java -Xmx10g -jar $SNPVSM/snpsvm.jar predict -R $fa -B $BAM -M $model -V $outdir/WBC_rum2.vcf
            Loading module class snpsvm.app.Predictor
            Scanning contigs in fasta file....done, found 83 contigs
            Exception in thread "main" java.lang.IllegalArgumentException: Could not find contig 2 in reference
            at snpsvm.app.Predictor.validateIntervals(Predictor.java:158)
            at snpsvm.app.Predictor.callSNPs(Predictor.java:178)
            at snpsvm.app.Predictor.performOperation(Predictor.java:132)
            at snpsvm.app.CommandLineApp.main(CommandLineApp.java:55)

            the fasta.fai looks like this:
            GL000207.1 4262 71 60 61
            GL000226.1 15008 4477 60 61
            GL000229.1 19913 19808 60 61
            GL000231.1 27386 40125 60 61
            GL000210.1 27682 68040 60 61
            GL000239.1 33824 96256 60 61
            GL000235.1 34474 130716 60 61
            GL000201.1 36148 165837 60 61
            GL000247.1 36422 202660 60 61
            GL000245.1 36651 239762 60 61
            GL000197.1 37175 277096 60 61
            GL000203.1 37498 314963 60 61
            GL000246.1 38154 353158 60 61
            GL000249.1 38502 392020 60 61
            GL000196.1 38914 431236 60 61
            GL000248.1 39786 470871 60 61
            GL000244.1 39929 511393 60 61
            GL000238.1 39939 552060 60 61
            GL000202.1 40103 592737 60 61
            GL000234.1 40531 633581 60 61
            GL000232.1 40652 674860 60 61
            GL000206.1 41001 716262 60 61
            GL000240.1 41933 758019 60 61
            GL000236.1 41934 800723 60 61
            GL000241.1 42152 843428 60 61
            GL000243.1 43341 886355 60 61
            GL000242.1 43523 930491 60 61
            GL000230.1 43691 974812 60 61
            GL000237.1 45867 1019304 60 61
            GL000233.1 45941 1066008 60 61
            GL000204.1 81310 1112787 60 61
            GL000198.1 90085 1195525 60 61
            GL000208.1 92689 1287184 60 61
            GL000191.1 106433 1381491 60 61
            GL000227.1 128374 1489771 60 61
            GL000228.1 129120 1620358 60 61
            GL000214.1 137718 1751703 60 61
            GL000221.1 155397 1891790 60 61
            GL000209.1 159169 2049850 60 61
            GL000218.1 161147 2211745 60 61
            GL000220.1 161802 2375651 60 61
            GL000213.1 164239 2540223 60 61
            GL000211.1 166566 2707273 60 61
            GL000199.1 169874 2876689 60 61
            GL000217.1 172149 3049468 60 61
            GL000216.1 172294 3224560 60 61
            GL000215.1 172545 3399799 60 61
            GL000205.1 174588 3575293 60 61
            GL000219.1 179198 3752864 60 61
            GL000224.1 179693 3935122 60 61
            GL000223.1 180455 4117883 60 61
            GL000195.1 182896 4301419 60 61
            GL000212.1 186858 4487437 60 61
            GL000222.1 186861 4677483 60 61
            GL000200.1 187035 4867532 60 61
            GL000193.1 189789 5057758 60 61
            GL000194.1 191469 5250784 60 61
            GL000225.1 211173 5445518 60 61
            GL000192.1 547496 5660284 60 61
            MT 16569 6216959 60 61
            Y 59373566 6233866 60 61
            22 51304566 66597049 60 61
            21 48129895 118756749 60 61
            19 59128983 167688866 60 61
            20 63025520 227803390 60 61
            18 78077248 291879393 60 61
            17 81195210 371257986 60 61
            16 90354753 453806507 60 61
            15 102531392 545667231 60 61
            14 107349540 649907538 60 61
            13 115169878 759046295 60 61
            9 141213431 876135727 60 61
            12 133851895 1019702774 60 61
            11 135006516 1155785592 60 61
            10 135534747 1293042275 60 61
            8 146364022 1430835991 60 61
            X 155270560 1579639470 60 61
            7 159138663 1737497929 60 61
            6 171115067 1899288960 60 61
            5 180915260 2073256001 60 61
            4 191154276 2257186572 60 61
            3 198022430 2451526809 60 61
            1 249250621 2652849669 60 61
            2 243199373 2906254524 60 61

            Comment


            • #7
              Hi there,
              I haven't tried to do anything with RNA-seq yet, so I'm not sure how well it would work. I think you're right that training on a RNA-seq specific dataset would probably be the best idea.
              Looks like there was a bug in the interval-validation stuff that cropped up in runs without intervals specified. I think I fixed it - try the new version on github and it should work ok now.

              Comment


              • #8
                Another thought about the figure in the paper. I've done a lot of testing with GATK, freebayes, and samtools, and I've never seen samtools generate a curve like that in the figure.

                Also, I'm certain you'll get drop-outs from MNPs and small haplotypes at a rate that basically explains the difference between freebayes and GATK in sensitivity (2-3%). If you ran the freebayes output through vcfallelicprimitives you'll get the decomposed SNP in a format consistent with the other callers. The different callers have different default definitions of what a variant is.

                Comment


                • #9
                  Thanks again, ekg. The funkiness in the samtools ROC curve was due to indels that I thought I had removed, but actually hadn't. It's corrected now and overall the curve seems very similar to FreeBayes and the UnifiedGenotyper - funny how little variation there is in those tools for single-sample SNP calling.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  18 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  22 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  17 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  48 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X