Hi,
I’m working with some clinical exome sequencing data and I have a few questions.
I followed the GATK’s best practices pipeline to obtain my vcf files. The thing is that I used both hg19 and hg38 genomes as reference (in parallel) in order to compare and contrast my results. As I read, most of the differences between both versions are in the non-coding regions of the genome, but I’m seeing a lot of them when comparing lists of candidate genes. When I establish my conditions and parse my data according to them, I see that the lists generated using one genome or another are quite different. Besides, I’m facing some issues with the dbSNP’s IDs (the rsXXXXX).
Some examples:
The ID rs4940595 appears in both callings, but differently. In hg19 case, this position has a T as the reference allele and a G as the alternative. When I annotate it (VEP, SnpEff) I see that the consequence of this SNP is a stop lost. On the other hand, in hg38 case, this position has a G as the reference allele and a T as the alternative and, when I annotate it, the proposed consequence is a stop gained. This really surprises me.
The ID rs855581 also appears in both calls. But this time the annotation is not the problem, the genotypes are. In hg19 case I find that some individuals are homozygous and some other heterozygous. With hg38, all of them are called as homozygous.
I did a liftOver using UCSC tool to make sure that the IDs are actually well annotated in those positions across versions, and they are.
These two are just a few examples. I want to say that I also find, generally, more variants when using hg38 than with hg19. Besides, even though with hg19 my lists are smaller, there are some variants that I don’t see with hg38. They are just gone.
Any thoughts on that?
Thank you so much.
I’m working with some clinical exome sequencing data and I have a few questions.
I followed the GATK’s best practices pipeline to obtain my vcf files. The thing is that I used both hg19 and hg38 genomes as reference (in parallel) in order to compare and contrast my results. As I read, most of the differences between both versions are in the non-coding regions of the genome, but I’m seeing a lot of them when comparing lists of candidate genes. When I establish my conditions and parse my data according to them, I see that the lists generated using one genome or another are quite different. Besides, I’m facing some issues with the dbSNP’s IDs (the rsXXXXX).
Some examples:
The ID rs4940595 appears in both callings, but differently. In hg19 case, this position has a T as the reference allele and a G as the alternative. When I annotate it (VEP, SnpEff) I see that the consequence of this SNP is a stop lost. On the other hand, in hg38 case, this position has a G as the reference allele and a T as the alternative and, when I annotate it, the proposed consequence is a stop gained. This really surprises me.
The ID rs855581 also appears in both calls. But this time the annotation is not the problem, the genotypes are. In hg19 case I find that some individuals are homozygous and some other heterozygous. With hg38, all of them are called as homozygous.
I did a liftOver using UCSC tool to make sure that the IDs are actually well annotated in those positions across versions, and they are.
These two are just a few examples. I want to say that I also find, generally, more variants when using hg38 than with hg19. Besides, even though with hg19 my lists are smaller, there are some variants that I don’t see with hg38. They are just gone.
Any thoughts on that?
Thank you so much.
Comment