Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Asking for vcf file when calling SNP using GATK

    Hi all,

    I sequenced transcriptome of 7 samples including 3 from one environment and the other 4 from another environment. I did de novo assembly and want to call SNPs using GATK. I merged the unigenes as a reference and now I plan to call the SNP from each sample. There is no reference genome neither knownSites of SNPs.


    The command lines I used are listed below:


    1. java -jar GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -R mergeunigene_ref.fa -T RealignerTargetCreator -I sample1_dedup.bam -o sample1.intervals

    2. java -jar GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -R mergeunigene_ref.fa -T IndelRealigner -targetIntervals sample1.intervals -I sample1_dedup.bam -o sample1_deduprealn.bam

    It runs well till here, but when I run BaseRecalibrator, Error is below:

    3. java -jar GenomeAnalysisTK-2.5-2-gf57256b/Genome
    AnalysisTK.jar -R mergeunigene_ref.fa -T BaseRecalibrator -I sample1_deduprealn7.bam -o sample1.grp
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
    ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ##### ERROR Please do not post this error to the GATK forum
    ##### ERROR
    ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ##### ERROR Visit our website and forum for extensive documentation and answers to
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.

    Does anybody meet this problem? Any comment is appreciated.

  • #2
    When working with a non-model organism for which little or no known variant information is available the GATK developers recommend that you "bootstrap" your own list of known variants following this procedure:

    I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

    The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

    However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

    • First do an initial round of SNP calling on your original, unrecalibrated data.

    • Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.

    • Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.
    Taken from http://gatkforums.broadinstitute.org...libration-bqsr

    Comment


    • #3
      I thank you Kmcarr very much!

      I read the website you recommend and find the following suggestions:

      First do an initial round of SNP calling on your original, unrecalibrated data.
      Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
      Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

      so I did the first round SNP calling using the following command lines:
      java -jar GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T UnifiedGenotyper -R mergedunigene.fa -I sample1_indelrealn7.bam -l INFO -o sample1.vcf -stand_call_conf 10 -stand_emit_conf 30
      java -jar GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T VariantFiltration -R mergedunigen.fa -V sample1.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o samples_final.vcf

      I did get a vcf file with SNPs, but how can I use it, just use the SNP with PASS? or other standard? Thank you Kmcarr!!!

      Comment


      • #4
        Hi again,

        I am calling SNPs on 7 transcriptome, and now running BaseRecalibrator using GATK. as there is no reference genome and knownSites of SNP, so I called snp directly after IndelRealigner. After filtering, the vcf file(with only PASS sites) was used as knownSites. The command line is:

        java -jar GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -R Acomyref.fa -T BaseRecalibrator -I sample1_indelrealn7.bam -knownSites sample1_filter.vcf -o sample1.grp

        The error is:
        ##### ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
        ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
        ##### ERROR Please do not post this error to the GATK forum
        ##### ERROR
        ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
        ##### ERROR Visit our website and forum for extensive documentation and answers to
        ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
        ##### ERROR
        ##### ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
        ##### ERROR Name FeatureType Documentation
        ##### ERROR BCF2 VariantContext http://www.broadinstitute.org/gatk/g...BCF2Codec.html
        ##### ERROR BEAGLE BeagleFeature http://www.broadinstitute.org/gatk/g...agleCodec.html
        ##### ERROR BED BEDFeature http://www.broadinstitute.org/gatk/g..._BEDCodec.html
        ##### ERROR BEDTABLE TableFeature http://www.broadinstitute.org/gatk/g...ableCodec.html
        ##### ERROR EXAMPLEBINARY Feature http://www.broadinstitute.org/gatk/g...naryCodec.html
        ##### ERROR GELITEXT GeliTextFeature http://www.broadinstitute.org/gatk/g...TextCodec.html
        ##### ERROR OLDDBSNP OldDbSNPFeature http://www.broadinstitute.org/gatk/g...bSNPCodec.html
        ##### ERROR RAWHAPMAP RawHapMapFeature http://www.broadinstitute.org/gatk/g...pMapCodec.html
        ##### ERROR REFSEQ RefSeqFeature http://www.broadinstitute.org/gatk/g...fSeqCodec.html
        ##### ERROR SAMPILEUP SAMPileupFeature http://www.broadinstitute.org/gatk/g...leupCodec.html
        ##### ERROR SAMREAD SAMReadFeature http://www.broadinstitute.org/gatk/g...ReadCodec.html
        ##### ERROR TABLE TableFeature http://www.broadinstitute.org/gatk/g...ableCodec.html
        ##### ERROR VCF VariantContext http://www.broadinstitute.org/gatk/g..._VCFCodec.html
        ##### ERROR VCF3 VariantContext http://www.broadinstitute.org/gatk/g...VCF3Codec.html

        If anybody met such problems, please show me how to fix it, Thanks!

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 08:47 AM
        0 responses
        11 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        54 views
        0 likes
        Last Post seqadmin  
        Working...
        X