SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
GATK: sorting vcf file given a reference file jorge Bioinformatics 4 01-14-2015 12:16 PM
VCF file for the Mouse genome (mm9) used for GATK gap Bioinformatics 6 05-23-2014 01:10 PM
Cufflinks non-model organism, issues with -b, and false concatenation of genes waspboyz Bioinformatics 3 06-20-2012 07:01 AM
Converting Dindel VCF file to GATK BED file MolecularToast Bioinformatics 2 09-24-2011 06:38 PM
Methods for 'exploratory analysis' of sequenced non-model organism transcriptomes ShellfishGene Bioinformatics 2 11-16-2010 08:38 AM

Reply
 
Thread Tools
Old 07-09-2012, 03:35 PM   #1
newbietonextgen
Member
 
Location: USA

Join Date: Nov 2010
Posts: 56
Default GATK with non-model organism (Help with making SNP VCF file))

Hi

Has any one tried using GATK with a non-model organism? If so kindly, would you tell me the format of the VCF files you are using? I was reading few posts at the GATK forum and got details about makings a VCF file. Below is a snap shot of mine VCF file.

##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO
1 11613 . C A . PASS .
1 12971 . T G . PASS .
1 13003 . T A . PASS .
1 13032 . A G . PASS .

GATK does make the tribble index file but at the end throws an error saying that "The provided VCF file is malformed at approximately line number 56981: The VCF specification does not allow for whitespace in the INFO"

The INFO column has ".", which i found was used by other for making a VCF file. Do i have to use VCF validate it? I do know VCF tools but think this file will not be able to pass it.
I would appreciate any help as i am stuck and need to complete this....

Thanks
newbietonextgen is offline   Reply With Quote
Old 07-09-2012, 03:49 PM   #2
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

What are you trying to accomplish by writing a VCF file manually? What is your goal with GATK?
adaptivegenome is offline   Reply With Quote
Old 07-09-2012, 03:51 PM   #3
newbietonextgen
Member
 
Location: USA

Join Date: Nov 2010
Posts: 56
Default

I need to call SNPs from a population to look at ASE. I have had used GATK before but with the change in format and inout files, its tough to recreate it...
newbietonextgen is offline   Reply With Quote
Old 07-09-2012, 03:59 PM   #4
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

So first you can call SNPs without specifying a VCF with known mutations. Then you don't need to write a VCF file.
adaptivegenome is offline   Reply With Quote
Old 07-09-2012, 04:29 PM   #5
newbietonextgen
Member
 
Location: USA

Join Date: Nov 2010
Posts: 56
Default

But i was looking into the protocol and it calls for ideally not doing it unless you are an expert..moreover dos it not help in the validation and recalibration steps?

Last edited by newbietonextgen; 07-09-2012 at 04:32 PM.
newbietonextgen is offline   Reply With Quote
Old 07-09-2012, 04:34 PM   #6
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Well you can use a known list of mutations during the VQSR step (variant score recalibration). But if you do not have known list, it doesn't help to make one up.

There are two suggestions I have (and others on SEQanswers might have more):

1. If you only have 1 genome, you might try calling without a known SNP file, then filter the output VCF for just the highest quality calls and then recall the SNPs using the filtered list as your known list. Not perfect, I know.

2. If you have a population of genomes you might consider calling again without a known list and then take a set of high quality calls found in many genomes from the population and make that your known set.

Does anyone else have better ideas? You might also get on the GetSatisfaction page for GATK and see if they have a better suggestion.

Hope this helps.
adaptivegenome is offline   Reply With Quote
Old 07-09-2012, 04:43 PM   #7
newbietonextgen
Member
 
Location: USA

Join Date: Nov 2010
Posts: 56
Default

thanks for the help. Yes, i do have two populations that i need to call the SNPs. I am sure your method works, but i will keep it as an last resort. But its a such as shame that such a useful tool is limited by just one file, VCF format file. I did post at the GetSatisfaction forum and i was told to validate my VCF file. I am not sure how this is going to work as most of columns are empty...but i will try to validate my VCF. ANy other suggestion ..please ...
newbietonextgen is offline   Reply With Quote
Old 09-10-2012, 07:59 AM   #8
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Ive been wrestling with this for a few days. You aren't actually limited to just VCF, you can use BED and others (http://gatkforums.broadinstitute.org...n/1349/tribble see part 4) but I couldn't get a BED that was used in v1 to work in v2, so made my own VCF.

@adaptivegenome I tried doing as you suggest but the whole point of realigning and recalibrating is to use known SNPs to inform the process (well that's is the point for me). I have called SNPs with mpileup->VarScan, with fairly strict definitions of what a SNP is, and got good results. The 'gold standard' of GATK seems to be called for when publishing... Also its a great tool theoretically.

Ok so my workflow as a hint for @newbie... make your VCF (I have "." for INFO and works fine); vcf-sort your.vcf > your.sorted.vcf; vcf-validator your.sorted.vcf; take your reference.fasta and remove any chromosomes/scaffolds not found in your VCF, otherwise it will throw ERROR (again!); sort new fasta exactly as your VCF is sorted (by chromosome/scaffold), no other order will do!; igvtools index your.sorted.fasta > your.sorted.fasta.fai; try and run this now and GATK should make your.sorted.dict for you and make the Tribble index. This seems to take a while (hence I am searching "tribble index" and am here=)

Good luck,

Bruce.
bruce01 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:09 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO