I have a bioinformatics query on the exome project we are running. We are using a NimbleGenV2 exome capture kit for target capture.
It's a unusual sort of question, and has been nagging me for more than a week now and nobody could provide a good answer yet:
Lets say I have processed raw reads from a tumor-normal paired exome experiment and made them fit for mutation calling. I have two bam files (one each for tumor and normal) that I feed into a mutation caller and since its an exome experiment,
Case 1: I limit the variant calls to mutations limited to the target regions only by using the .bed file from the NimbleGen website, as an interval parameter.
Now, theoretically all the mutation calls made by the caller are exonic or splicing. I have 2100 SNVs.
I run these calls through an annotation software and annotate it against a refgene set (Annovar (uses directly downloaded UCSC refgene set), more than 92% of the SNVs are annotated as "exonic" or "splicing" as expected..
Case 2: I limit the variant calls to mutations limited to exons + 10 bases only by generating a .bed file of refgenes from the UCSC table browser, and use it as an interval parameter.
Now, once again theoretically all the mutation calls made by the caller are exonic or splicing. I have 2700 SNVs.
But when I run these calls through an annotation software and annotate it against a refgene set (Annovar again), only approximately 65%-75% of the calls are exonic or splicing. The rest are annotated as intronic, upstream, downstream and a zillion other things..
(1) My understanding is that the 2100 vs 2700 are because of possible misalignment of a fraction of the reads into non target regions and hence the extra 600 SNVs comprise false positive mutation calls, for the most part (correct me if I am wrong).
(2) The 92% vs 65-75% on the other hand is quite inexplicable. In both cases the caller was asked to call variants in only exonic regions; which in the former case was the capture target regions, and in the latter case was the refgene set of exons got from the Table Browser. I would have expected >90% exonic variants in Case 2 also..
Have you noticed this before? Is there an explanation as to why (2) is happening?
It's a unusual sort of question, and has been nagging me for more than a week now and nobody could provide a good answer yet:
Lets say I have processed raw reads from a tumor-normal paired exome experiment and made them fit for mutation calling. I have two bam files (one each for tumor and normal) that I feed into a mutation caller and since its an exome experiment,
Case 1: I limit the variant calls to mutations limited to the target regions only by using the .bed file from the NimbleGen website, as an interval parameter.
Now, theoretically all the mutation calls made by the caller are exonic or splicing. I have 2100 SNVs.
I run these calls through an annotation software and annotate it against a refgene set (Annovar (uses directly downloaded UCSC refgene set), more than 92% of the SNVs are annotated as "exonic" or "splicing" as expected..
Case 2: I limit the variant calls to mutations limited to exons + 10 bases only by generating a .bed file of refgenes from the UCSC table browser, and use it as an interval parameter.
Now, once again theoretically all the mutation calls made by the caller are exonic or splicing. I have 2700 SNVs.
But when I run these calls through an annotation software and annotate it against a refgene set (Annovar again), only approximately 65%-75% of the calls are exonic or splicing. The rest are annotated as intronic, upstream, downstream and a zillion other things..
(1) My understanding is that the 2100 vs 2700 are because of possible misalignment of a fraction of the reads into non target regions and hence the extra 600 SNVs comprise false positive mutation calls, for the most part (correct me if I am wrong).
(2) The 92% vs 65-75% on the other hand is quite inexplicable. In both cases the caller was asked to call variants in only exonic regions; which in the former case was the capture target regions, and in the latter case was the refgene set of exons got from the Table Browser. I would have expected >90% exonic variants in Case 2 also..
Have you noticed this before? Is there an explanation as to why (2) is happening?
Comment