![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Help: cufflinks stopped in one loci | lewewoo | Bioinformatics | 7 | 09-08-2014 02:38 AM |
SNP calling and allele specific expression by RNAseq | lewewoo | Bioinformatics | 12 | 11-25-2012 11:50 PM |
GATK UnifiedGenotyper not calling any variants | krobasky | Bioinformatics | 7 | 10-25-2012 11:19 PM |
GATK UnifiedGenotyper calling way too many SNPs in vcf | swbarnes2 | Bioinformatics | 0 | 08-17-2011 02:33 PM |
GATK UnifiedGenotyper SNP calling with Agilent 50Mb target enrichment | nullalleles | Genomic Resequencing | 1 | 07-18-2011 07:28 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Salt Lake City, UT Join Date: Jul 2011
Posts: 12
|
![]()
Hi, I've been having a little trouble (and confusion) trying to get the right set of calls out of the GATK's UnifiedGenotyper; I'm doing analysis of of known and novel SNPs, and I'd like to get calls for both:
- observed variants across my 50 samples (including novel ones) - known variant sites (even though all samples may be homozygous reference) for which there is sufficient read coverage, etc. The no-call vs homozygous reference difference is a really a big deal for my analysis. When I run with -out_mode EMIT_VARIANTS_ONLY, I only get calls for sites where my samples vary, and when I run with -out_mode EMIT_ALL_CONFIDENT_SITES, I get a huge .vcf file with tons of calls everywhere I had decent coverage. Is there a way to tell UnifiedGenotyper to call specific sites, e.g. the ones listed in the --dbsnp rod file? Or do I need to run with EMIT_ALL_CONFIDENT_SITES, and filter by hand (which I'd like to avoid)? |
![]() |
![]() |
![]() |
#2 |
Member
Location: Salt Lake City, UT Join Date: Jul 2011
Posts: 12
|
![]()
I wound up writing a Python script to filter the results of EMIT_ALL_CONFIDENT_SITES with the dbSNP .vcf file - as it's bundled with a bunch of my still-unfinished-helper-scripts I'm not going to post it publicly, but PM me if you would like a copy.
|
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]()
-L "chr4:21942-34289"
The -L flag in combination with EMIT_ALL_SITES should get what you want. |
![]() |
![]() |
![]() |
#4 |
Member
Location: Salt Lake City, UT Join Date: Jul 2011
Posts: 12
|
![]()
Thanks for the response, but that just filters to a specific region (which I don't care about) - what I want are calls for every locus that is variant in my data, as well as calls for every locus that is known to be variant, irrelevant of position. These could be anywhere across the genome.
Perhaps it would make sense to phrase it differently: when I query a specific rs number to see what my samples had at that location, very often there isn't even a line for it in the .vcf file. If I check the coverage at that location, it's more than ample (and running with EMIT_ALL_CONFIDENT_SITES does give me a call for the variant with high quality scores), but it wasn't included in the .vcf file using EMIT_VARIANTS_ONLY because all my samples are homozygous reference for that locus. It's really important for me to be able to query a known variant and know the difference between a true no-call (where there isn't enough data) and something that wasn't called because everyone was homozygous reference. |
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]() Quote:
I don't know if GATK has native support for this or not. But BEAGLE author did exactly that with this script he wrote: http://bochet.gcc.biostat.washington...agle_phase1_v3 but this script might have more stringent filtering criteria than what you have in mind, so you might need to tweak it a little bit. Good luck! |
|
![]() |
![]() |
![]() |
#6 | ||
Member
Location: Salt Lake City, UT Join Date: Jul 2011
Posts: 12
|
![]() Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#7 | |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]() Quote:
Of course, some imputation can also be helpful if some of your sites have very poor coverage. |
|
![]() |
![]() |
![]() |
#8 |
Member
Location: Boston, USA Join Date: Jan 2011
Posts: 20
|
![]()
hi yasashiku,
I am dealing with the same issue and came across your post hear while searching for help. so wondering if you ever found the way to address this problem. if so, could you please share your experience. Thanks Rama |
![]() |
![]() |
![]() |
#9 | |
Member
Location: Salt Lake City, UT Join Date: Jul 2011
Posts: 12
|
![]() Quote:
I've actually moved on to another research area entirely, so I'm probably not going to be a ton of help - I know some of these parameter names have probably already changed. My best advice is to try two calling passes; one that picks up things that vary in your data, and another EMIT_ALL_CONFIDENT_SITES run with -L someBedFile.bed, where someBedFile.bed has a ton of 1-bp-long regions for every known variant. Maybe some tool out there will convert the .vcf file containing known sites into a .bed file? Finally, merge the two .vcf files, and be warned that some of the lines will be redundant. A cleaner idea would be to run once with EMIT_ALL_CONFIDENT_SITES if you have enough space on your hard drive (the result will be massive), and filter it immediately after using your own script. In case it's useful, I've posted the scripts I wrote back when I was playing with NGS data here. None of them will help you directly, but if you feel like hacking my python code, filterVCF.py might get you close. If you're uncomfortable writing/hacking scripts, Data Wrangler or OpenRefine (formerly Google Refine) might be able to help. |
|
![]() |
![]() |
![]() |
#10 |
Member
Location: Boston, USA Join Date: Jan 2011
Posts: 20
|
![]()
Hi yasashiku,
Thanks a bunch for your kind reply. I am going to try your 1st suggestion as this seem to be the most common according to the GATK forums. Rama |
![]() |
![]() |
![]() |
Tags |
gatk, unifiedgenotyper |
Thread Tools | |
|
|