SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
UCSC liftover or NCBI remap for hg19 to hg38 conversion cmccabe Bioinformatics 1 08-10-2016 03:52 PM
HISAT2 vs. TopHat2: Discrepancies between Hg19 & Hg38 Sbamo RNA Sequencing 0 01-26-2016 05:05 AM
Cufflinks and hg38 annotation abisko00 RNA Sequencing 0 09-03-2015 08:48 AM
Alignment on hg19 or hg38 for exome-seq data blancha Bioinformatics 5 04-21-2015 07:22 AM
differences between gtf file and indexing file (hg19) rozitaa Bioinformatics 1 07-11-2013 06:31 AM

Reply
 
Thread Tools
Old 04-20-2017, 06:27 AM   #1
ManoloS7
Junior Member
 
Location: Spain

Join Date: Apr 2017
Posts: 2
Default hg19 vs hg38. Notable differences.

Hi,

Iím working with some clinical exome sequencing data and I have a few questions.

I followed the GATKís best practices pipeline to obtain my vcf files. The thing is that I used both hg19 and hg38 genomes as reference (in parallel) in order to compare and contrast my results. As I read, most of the differences between both versions are in the non-coding regions of the genome, but Iím seeing a lot of them when comparing lists of candidate genes. When I establish my conditions and parse my data according to them, I see that the lists generated using one genome or another are quite different. Besides, Iím facing some issues with the dbSNPís IDs (the rsXXXXX).

Some examples:

The ID rs4940595 appears in both callings, but differently. In hg19 case, this position has a T as the reference allele and a G as the alternative. When I annotate it (VEP, SnpEff) I see that the consequence of this SNP is a stop lost. On the other hand, in hg38 case, this position has a G as the reference allele and a T as the alternative and, when I annotate it, the proposed consequence is a stop gained. This really surprises me.

The ID rs855581 also appears in both calls. But this time the annotation is not the problem, the genotypes are. In hg19 case I find that some individuals are homozygous and some other heterozygous. With hg38, all of them are called as homozygous.

I did a liftOver using UCSC tool to make sure that the IDs are actually well annotated in those positions across versions, and they are.

These two are just a few examples. I want to say that I also find, generally, more variants when using hg38 than with hg19. Besides, even though with hg19 my lists are smaller, there are some variants that I donít see with hg38. They are just gone.

Any thoughts on that?

Thank you so much.
ManoloS7 is offline   Reply With Quote
Old 04-20-2017, 08:01 AM   #2
r.rosati
Member
 
Location: Brazil

Join Date: Aug 2015
Posts: 27
Default

About rs4940595: you're right, the hg19 build includes the (more frequent) T allele, that codes for a stop codon; while GRCh38 includes the less frequent G allele. Anyways, considering that the two alleles are both very frequent in the population (about 65% T and 35% G overall), just leave clear which build you've used for your analysis and be consistent with it; then there's no problem in my opinion.

About rs855581: that's another story. It could be due to several reasons and I might not be the most experienced to comment, but... It could be that one of the two builds includes a pseudogene, and some reads in that position get a low score for matching the pseudogene also (i.e. not being uniquely aligned); and as a consequence these reads aren't counted. Apart of that, the sequence there is identical in the two builds, so it's not a question of difficult regions.

About the other questions, I don't have an answer but an advice: I've learned a lot by visualizing the VCF and BAM of "strange" calls on IGV; you can really grasp how some parameters can affect variant calling.

Last edited by r.rosati; 04-20-2017 at 08:03 AM.
r.rosati is offline   Reply With Quote
Old 04-21-2017, 02:05 AM   #3
ManoloS7
Junior Member
 
Location: Spain

Join Date: Apr 2017
Posts: 2
Default

Quote:
Originally Posted by r.rosati View Post
About rs4940595: you're right, the hg19 build includes the (more frequent) T allele, that codes for a stop codon; while GRCh38 includes the less frequent G allele. Anyways, considering that the two alleles are both very frequent in the population (about 65% T and 35% G overall), just leave clear which build you've used for your analysis and be consistent with it; then there's no problem in my opinion.

About rs855581: that's another story. It could be due to several reasons and I might not be the most experienced to comment, but... It could be that one of the two builds includes a pseudogene, and some reads in that position get a low score for matching the pseudogene also (i.e. not being uniquely aligned); and as a consequence these reads aren't counted. Apart of that, the sequence there is identical in the two builds, so it's not a question of difficult regions.

About the other questions, I don't have an answer but an advice: I've learned a lot by visualizing the VCF and BAM of "strange" calls on IGV; you can really grasp how some parameters can affect variant calling.
Dear r.rosati,

First of all, thank you for your answer.

I'm using IGV to visuallize my data around the area of that SNP (rs855581) and I'm seeing that there are more SNPs mapped around the one in hg19 than the one in hg38, but I'm not sure if that means anything at all.

My real problem here is that I don't know if I can trust hg38 more than hg19 or backwards.
ManoloS7 is offline   Reply With Quote
Reply

Tags
dbsnp, differences, hg19, hg38

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO