SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Help: cufflinks stopped in one loci lewewoo Bioinformatics 7 09-08-2014 01:38 AM
SNP calling and allele specific expression by RNAseq lewewoo Bioinformatics 12 11-25-2012 10:50 PM
GATK UnifiedGenotyper not calling any variants krobasky Bioinformatics 7 10-25-2012 10:19 PM
GATK UnifiedGenotyper calling way too many SNPs in vcf swbarnes2 Bioinformatics 0 08-17-2011 01:33 PM
GATK UnifiedGenotyper SNP calling with Agilent 50Mb target enrichment nullalleles Genomic Resequencing 1 07-18-2011 06:28 AM

Reply
 
Thread Tools
Old 03-26-2012, 12:31 PM   #1
yasashiku
Member
 
Location: Salt Lake City, UT

Join Date: Jul 2011
Posts: 12
Default Calling specific loci with UnifiedGenotyper

Hi, I've been having a little trouble (and confusion) trying to get the right set of calls out of the GATK's UnifiedGenotyper; I'm doing analysis of of known and novel SNPs, and I'd like to get calls for both:
- observed variants across my 50 samples (including novel ones)
- known variant sites (even though all samples may be homozygous reference) for which there is sufficient read coverage, etc.
The no-call vs homozygous reference difference is a really a big deal for my analysis. When I run with -out_mode EMIT_VARIANTS_ONLY, I only get calls for sites where my samples vary, and when I run with -out_mode EMIT_ALL_CONFIDENT_SITES, I get a huge .vcf file with tons of calls everywhere I had decent coverage. Is there a way to tell UnifiedGenotyper to call specific sites, e.g. the ones listed in the --dbsnp rod file? Or do I need to run with EMIT_ALL_CONFIDENT_SITES, and filter by hand (which I'd like to avoid)?
yasashiku is offline   Reply With Quote
Old 05-10-2012, 06:11 AM   #2
yasashiku
Member
 
Location: Salt Lake City, UT

Join Date: Jul 2011
Posts: 12
Default

I wound up writing a Python script to filter the results of EMIT_ALL_CONFIDENT_SITES with the dbSNP .vcf file - as it's bundled with a bunch of my still-unfinished-helper-scripts I'm not going to post it publicly, but PM me if you would like a copy.
yasashiku is offline   Reply With Quote
Old 05-11-2012, 02:36 AM   #3
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

-L "chr4:21942-34289"

The -L flag in combination with EMIT_ALL_SITES should get what you want.
ymc is offline   Reply With Quote
Old 05-11-2012, 06:13 AM   #4
yasashiku
Member
 
Location: Salt Lake City, UT

Join Date: Jul 2011
Posts: 12
Default

Thanks for the response, but that just filters to a specific region (which I don't care about) - what I want are calls for every locus that is variant in my data, as well as calls for every locus that is known to be variant, irrelevant of position. These could be anywhere across the genome.
Perhaps it would make sense to phrase it differently: when I query a specific rs number to see what my samples had at that location, very often there isn't even a line for it in the .vcf file. If I check the coverage at that location, it's more than ample (and running with EMIT_ALL_CONFIDENT_SITES does give me a call for the variant with high quality scores), but it wasn't included in the .vcf file using EMIT_VARIANTS_ONLY because all my samples are homozygous reference for that locus. It's really important for me to be able to query a known variant and know the difference between a true no-call (where there isn't enough data) and something that wasn't called because everyone was homozygous reference.
yasashiku is offline   Reply With Quote
Old 05-14-2012, 02:30 PM   #5
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Quote:
Originally Posted by yasashiku View Post
Thanks for the response, but that just filters to a specific region (which I don't care about) - what I want are calls for every locus that is variant in my data, as well as calls for every locus that is known to be variant, irrelevant of position. These could be anywhere across the genome.
Perhaps it would make sense to phrase it differently: when I query a specific rs number to see what my samples had at that location, very often there isn't even a line for it in the .vcf file. If I check the coverage at that location, it's more than ample (and running with EMIT_ALL_CONFIDENT_SITES does give me a call for the variant with high quality scores), but it wasn't included in the .vcf file using EMIT_VARIANTS_ONLY because all my samples are homozygous reference for that locus. It's really important for me to be able to query a known variant and know the difference between a true no-call (where there isn't enough data) and something that wasn't called because everyone was homozygous reference.
Do you mean if you want to print every site that has a >0 minor allele frequency among your samples?

I don't know if GATK has native support for this or not. But BEAGLE author did exactly that with this script he wrote:

http://bochet.gcc.biostat.washington...agle_phase1_v3

but this script might have more stringent filtering criteria than what you have in mind, so you might need to tweak it a little bit. Good luck!
ymc is offline   Reply With Quote
Old 05-15-2012, 01:45 PM   #6
yasashiku
Member
 
Location: Salt Lake City, UT

Join Date: Jul 2011
Posts: 12
Default

Quote:
Originally Posted by ymc View Post
Do you mean if you want to print every site that has a >0 minor allele frequency among your samples?
Not exactly; I want every site that has a >0 minor allele frequency among my samples AND every site in my DBSNP rod file that has =0 minor allele frequency among my samples, but enough coverage to make a call.

Quote:
I don't know if GATK has native support for this or not. But BEAGLE author did exactly that with this script he wrote...
What's the point of imputation when I can get the calls directly? GATK just suppresses them in the output by default because none of my samples are variant - this is the behavior I'm trying to get around.
yasashiku is offline   Reply With Quote
Old 05-15-2012, 06:08 PM   #7
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Quote:
Originally Posted by yasashiku View Post
Not exactly; I want every site that has a >0 minor allele frequency among my samples AND every site in my DBSNP rod file that has =0 minor allele frequency among my samples, but enough coverage to make a call.


What's the point of imputation when I can get the calls directly? GATK just suppresses them in the output by default because none of my samples are variant - this is the behavior I'm trying to get around.
I am not telling you tell do an imputation. I am telling you that the BEAGLE author wrote a script to convert 1000genomes data into his BEAGLE format. In the process, I think he is doing approximately what you want.

Of course, some imputation can also be helpful if some of your sites have very poor coverage.
ymc is offline   Reply With Quote
Old 09-23-2013, 11:52 AM   #8
rama
Member
 
Location: Boston, USA

Join Date: Jan 2011
Posts: 20
Default

hi yasashiku,

I am dealing with the same issue and came across your post hear while searching for help. so wondering if you ever found the way to address this problem. if so, could you please share your experience.

Thanks
Rama
rama is offline   Reply With Quote
Old 09-23-2013, 02:43 PM   #9
yasashiku
Member
 
Location: Salt Lake City, UT

Join Date: Jul 2011
Posts: 12
Default

Quote:
Originally Posted by rama View Post
hi yasashiku,

I am dealing with the same issue and came across your post hear while searching for help. so wondering if you ever found the way to address this problem. if so, could you please share your experience.

Thanks
Rama
Hi Rama,
I've actually moved on to another research area entirely, so I'm probably not going to be a ton of help - I know some of these parameter names have probably already changed. My best advice is to try two calling passes; one that picks up things that vary in your data, and another EMIT_ALL_CONFIDENT_SITES run with -L someBedFile.bed, where someBedFile.bed has a ton of 1-bp-long regions for every known variant. Maybe some tool out there will convert the .vcf file containing known sites into a .bed file? Finally, merge the two .vcf files, and be warned that some of the lines will be redundant.

A cleaner idea would be to run once with EMIT_ALL_CONFIDENT_SITES if you have enough space on your hard drive (the result will be massive), and filter it immediately after using your own script. In case it's useful, I've posted the scripts I wrote back when I was playing with NGS data here. None of them will help you directly, but if you feel like hacking my python code, filterVCF.py might get you close. If you're uncomfortable writing/hacking scripts, Data Wrangler or OpenRefine (formerly Google Refine) might be able to help.
yasashiku is offline   Reply With Quote
Old 09-24-2013, 09:16 AM   #10
rama
Member
 
Location: Boston, USA

Join Date: Jan 2011
Posts: 20
Default

Hi yasashiku,

Thanks a bunch for your kind reply. I am going to try your 1st suggestion as this seem to be the most common according to the GATK forums.

Rama
rama is offline   Reply With Quote
Reply

Tags
gatk, unifiedgenotyper

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:12 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO