Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK with non-model organism (Help with making SNP VCF file))

    Hi

    Has any one tried using GATK with a non-model organism? If so kindly, would you tell me the format of the VCF files you are using? I was reading few posts at the GATK forum and got details about makings a VCF file. Below is a snap shot of mine VCF file.

    ##fileformat=VCFv4.0
    #CHROM POS ID REF ALT QUAL FILTER INFO
    1 11613 . C A . PASS .
    1 12971 . T G . PASS .
    1 13003 . T A . PASS .
    1 13032 . A G . PASS .

    GATK does make the tribble index file but at the end throws an error saying that "The provided VCF file is malformed at approximately line number 56981: The VCF specification does not allow for whitespace in the INFO"

    The INFO column has ".", which i found was used by other for making a VCF file. Do i have to use VCF validate it? I do know VCF tools but think this file will not be able to pass it.
    I would appreciate any help as i am stuck and need to complete this....

    Thanks

  • #2
    What are you trying to accomplish by writing a VCF file manually? What is your goal with GATK?

    Comment


    • #3
      I need to call SNPs from a population to look at ASE. I have had used GATK before but with the change in format and inout files, its tough to recreate it...

      Comment


      • #4
        So first you can call SNPs without specifying a VCF with known mutations. Then you don't need to write a VCF file.

        Comment


        • #5
          But i was looking into the protocol and it calls for ideally not doing it unless you are an expert..moreover dos it not help in the validation and recalibration steps?
          Last edited by newbietonextgen; 07-09-2012, 04:32 PM.

          Comment


          • #6
            Well you can use a known list of mutations during the VQSR step (variant score recalibration). But if you do not have known list, it doesn't help to make one up.

            There are two suggestions I have (and others on SEQanswers might have more):

            1. If you only have 1 genome, you might try calling without a known SNP file, then filter the output VCF for just the highest quality calls and then recall the SNPs using the filtered list as your known list. Not perfect, I know.

            2. If you have a population of genomes you might consider calling again without a known list and then take a set of high quality calls found in many genomes from the population and make that your known set.

            Does anyone else have better ideas? You might also get on the GetSatisfaction page for GATK and see if they have a better suggestion.

            Hope this helps.

            Comment


            • #7
              thanks for the help. Yes, i do have two populations that i need to call the SNPs. I am sure your method works, but i will keep it as an last resort. But its a such as shame that such a useful tool is limited by just one file, VCF format file. I did post at the GetSatisfaction forum and i was told to validate my VCF file. I am not sure how this is going to work as most of columns are empty...but i will try to validate my VCF. ANy other suggestion ..please ...

              Comment


              • #8
                Ive been wrestling with this for a few days. You aren't actually limited to just VCF, you can use BED and others (http://gatkforums.broadinstitute.org...n/1349/tribble see part 4) but I couldn't get a BED that was used in v1 to work in v2, so made my own VCF.

                @adaptivegenome I tried doing as you suggest but the whole point of realigning and recalibrating is to use known SNPs to inform the process (well that's is the point for me). I have called SNPs with mpileup->VarScan, with fairly strict definitions of what a SNP is, and got good results. The 'gold standard' of GATK seems to be called for when publishing... Also its a great tool theoretically.

                Ok so my workflow as a hint for @newbie... make your VCF (I have "." for INFO and works fine); vcf-sort your.vcf > your.sorted.vcf; vcf-validator your.sorted.vcf; take your reference.fasta and remove any chromosomes/scaffolds not found in your VCF, otherwise it will throw ERROR (again!); sort new fasta exactly as your VCF is sorted (by chromosome/scaffold), no other order will do!; igvtools index your.sorted.fasta > your.sorted.fasta.fai; try and run this now and GATK should make your.sorted.dict for you and make the Tribble index. This seems to take a while (hence I am searching "tribble index" and am here=)

                Good luck,

                Bruce.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                22 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X