![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
GATK: sorting vcf file given a reference file | jorge | Bioinformatics | 4 | 01-14-2015 01:16 PM |
VCF file for the Mouse genome (mm9) used for GATK | gap | Bioinformatics | 6 | 05-23-2014 02:10 PM |
Cufflinks non-model organism, issues with -b, and false concatenation of genes | waspboyz | Bioinformatics | 3 | 06-20-2012 08:01 AM |
Converting Dindel VCF file to GATK BED file | MolecularToast | Bioinformatics | 2 | 09-24-2011 07:38 PM |
Methods for 'exploratory analysis' of sequenced non-model organism transcriptomes | ShellfishGene | Bioinformatics | 2 | 11-16-2010 09:38 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: USA Join Date: Nov 2010
Posts: 56
|
![]()
Hi
Has any one tried using GATK with a non-model organism? If so kindly, would you tell me the format of the VCF files you are using? I was reading few posts at the GATK forum and got details about makings a VCF file. Below is a snap shot of mine VCF file. ##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO 1 11613 . C A . PASS . 1 12971 . T G . PASS . 1 13003 . T A . PASS . 1 13032 . A G . PASS . GATK does make the tribble index file but at the end throws an error saying that "The provided VCF file is malformed at approximately line number 56981: The VCF specification does not allow for whitespace in the INFO" The INFO column has ".", which i found was used by other for making a VCF file. Do i have to use VCF validate it? I do know VCF tools but think this file will not be able to pass it. I would appreciate any help as i am stuck and need to complete this.... Thanks |
![]() |
![]() |
![]() |
#2 |
Super Moderator
Location: US Join Date: Nov 2009
Posts: 437
|
![]()
What are you trying to accomplish by writing a VCF file manually? What is your goal with GATK?
|
![]() |
![]() |
![]() |
#3 |
Member
Location: USA Join Date: Nov 2010
Posts: 56
|
![]()
I need to call SNPs from a population to look at ASE. I have had used GATK before but with the change in format and inout files, its tough to recreate it...
|
![]() |
![]() |
![]() |
#4 |
Super Moderator
Location: US Join Date: Nov 2009
Posts: 437
|
![]()
So first you can call SNPs without specifying a VCF with known mutations. Then you don't need to write a VCF file.
|
![]() |
![]() |
![]() |
#5 |
Member
Location: USA Join Date: Nov 2010
Posts: 56
|
![]()
But i was looking into the protocol and it calls for ideally not doing it unless you are an expert..moreover dos it not help in the validation and recalibration steps?
Last edited by newbietonextgen; 07-09-2012 at 05:32 PM. |
![]() |
![]() |
![]() |
#6 |
Super Moderator
Location: US Join Date: Nov 2009
Posts: 437
|
![]()
Well you can use a known list of mutations during the VQSR step (variant score recalibration). But if you do not have known list, it doesn't help to make one up.
There are two suggestions I have (and others on SEQanswers might have more): 1. If you only have 1 genome, you might try calling without a known SNP file, then filter the output VCF for just the highest quality calls and then recall the SNPs using the filtered list as your known list. Not perfect, I know. 2. If you have a population of genomes you might consider calling again without a known list and then take a set of high quality calls found in many genomes from the population and make that your known set. Does anyone else have better ideas? You might also get on the GetSatisfaction page for GATK and see if they have a better suggestion. Hope this helps. |
![]() |
![]() |
![]() |
#7 |
Member
Location: USA Join Date: Nov 2010
Posts: 56
|
![]()
thanks for the help. Yes, i do have two populations that i need to call the SNPs. I am sure your method works, but i will keep it as an last resort. But its a such as shame that such a useful tool is limited by just one file, VCF format file. I did post at the GetSatisfaction forum and i was told to validate my VCF file. I am not sure how this is going to work as most of columns are empty...but i will try to validate my VCF. ANy other suggestion ..please ...
|
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: . Join Date: Mar 2011
Posts: 157
|
![]()
Ive been wrestling with this for a few days. You aren't actually limited to just VCF, you can use BED and others (http://gatkforums.broadinstitute.org...n/1349/tribble see part 4) but I couldn't get a BED that was used in v1 to work in v2, so made my own VCF.
@adaptivegenome I tried doing as you suggest but the whole point of realigning and recalibrating is to use known SNPs to inform the process (well that's is the point for me). I have called SNPs with mpileup->VarScan, with fairly strict definitions of what a SNP is, and got good results. The 'gold standard' of GATK seems to be called for when publishing... Also its a great tool theoretically. Ok so my workflow as a hint for @newbie... make your VCF (I have "." for INFO and works fine); vcf-sort your.vcf > your.sorted.vcf; vcf-validator your.sorted.vcf; take your reference.fasta and remove any chromosomes/scaffolds not found in your VCF, otherwise it will throw ERROR (again!); sort new fasta exactly as your VCF is sorted (by chromosome/scaffold), no other order will do!; igvtools index your.sorted.fasta > your.sorted.fasta.fai; try and run this now and GATK should make your.sorted.dict for you and make the Tribble index. This seems to take a while (hence I am searching "tribble index" and am here=) Good luck, Bruce. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|