SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
"allele balance ratio" and "quality by depth" in VCF files efoss Bioinformatics 2 10-25-2011 12:13 PM
Relatively large proportion of "LOWDATA", "FAIL" of FPKM_status running cufflink ruben6um Bioinformatics 3 10-12-2011 01:39 AM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 08:55 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 01:19 PM
SEQanswers second "publication": "How to map billions of short reads onto genomes" ECO Literature Watch 0 06-30-2009 12:49 AM

Reply
 
Thread Tools
Old 04-05-2011, 07:18 AM   #1
Yilong Li
Member
 
Location: WTSI

Join Date: Dec 2010
Posts: 41
Default "Obvious variant" missed by GATK UnifiedGenotyper

Hi all,

We have noticed in our lab that GATK UnifiedGenotyper misses some variants (as one shown in the IGV figure below) that seem reasonably obvious with good alt. allele counts.

I ran GATK UG -all_bases, and this site was called as
Code:
1       43308444        .       A       .       35.64   .       AC=0;AF=0.00;AN=2;DP=38;Dels=0.00;MQ=57.26;MQ0=0        GT:DP:GQ:PL     0/0:24:5.65:0,6,677
Does anybody have any ideas why GATK UG called this site as ref (with such high confidence!) and how to loosen up the GATK UG parameters to have this site called as a variant?

Yilong


Last edited by Yilong Li; 04-05-2011 at 07:23 AM. Reason: updated image url
Yilong Li is offline   Reply With Quote
Old 04-05-2011, 09:34 AM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Two things to observe:

1. The "C" bases are shaded or greyed out, meaning they are of low quality.
2. They are towards the ends of reads (SOLID?), and are therefore more ambiguous. Also, he top C read has another improper C to the left (same for some of the others), so how much should we trust this/these reads?

I think the genotyper did the absolute correct thing. Have you tried validating this with Sanger Sequencing?

I would not trust a C here.
nilshomer is offline   Reply With Quote
Old 04-05-2011, 09:58 AM   #3
Yilong Li
Member
 
Location: WTSI

Join Date: Dec 2010
Posts: 41
Default

Sorry, I didn't provide enough information regarding the situation. I ran GATK UG using --min_base_quality 1 (so all the grey C bases should have been included in the calling). Data is from Illumina.

There are several C bases in the position with decent quality, although I must say those reads do seem a bit dubious. I agree that this could be a false positive (haven't Sanger sequenced), but I am still interested what happens internally in GATK UG at this position and how I could call this or similar position as variant loci. I am trying to find recurrent mutation within a set of samples and therefore want to achieve maximal a relatively high sensitivity for single sample variant calls.
Yilong Li is offline   Reply With Quote
Old 04-05-2011, 11:15 AM   #4
jstjohn
Member
 
Location: San Francisco, CA

Join Date: Jun 2010
Posts: 35
Default

Hi Yilong,
Did you follow the full GATK protocol with realignment and base quality recalibration prior to genotyping? I agree that a "C" call does not seem very likely in that position though.
-John
jstjohn is offline   Reply With Quote
Old 04-05-2011, 12:24 PM   #5
Yilong Li
Member
 
Location: WTSI

Join Date: Dec 2010
Posts: 41
Default

I have done realignment but not base quality recalibration.

If I understood correctly though, GATK UG calls variants from a bam file and does not know whether the realignment or recal has been done, so even if the base qualities I use were recalibrated, GATK UG wouldn't make any differing calls.
Yilong Li is offline   Reply With Quote
Old 04-05-2011, 12:50 PM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Also observe that all the Cs are from reads on the same strand. Not a good thing!
nilshomer is offline   Reply With Quote
Old 04-05-2011, 01:07 PM   #7
Yilong Li
Member
 
Location: WTSI

Join Date: Dec 2010
Posts: 41
Default

Yes, I fully agree that this particular position has so many dubious factors that it is unlikely to be a true variant .

However, currently we would like to find recurrent variants present in multiple samples. For example, if a high confidence variant was called at this position in another sample, we would like to know whether there is even the weakest evidence that a variant exists also in the current sample. That's why we want to get the not-so-likely variants called as well.
Yilong Li is offline   Reply With Quote
Old 04-05-2011, 01:39 PM   #8
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

You need to be careful of calling so aggressively that the "not-so-likely" variants are called. What will most likely happen is your will end up with a bunch of recurrent variants that are all noise as some regions will come up again and again with such relaxed settings and you will waste your time and money validating false positives. If you goal is what you say try screening samples with a stringent filter, then generate a pileup of the individual positions with samtools to see if the same variant exists in the other samples. Assuming the variant if frequent is should show up with a stringent filter in at least one sample if not a couple/most of the samples. This should limit the number of false positives detected and eliminate a horrible realization that your bioinformatics filtering strategy resulted in a validation rate of 2%. Tends to make PIs and pipet jockeys chase you around the institute with anger in their eyes.
Jon_Keats is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:12 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO