SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
UCSC liftover or NCBI remap for hg19 to hg38 conversion cmccabe Bioinformatics 1 08-10-2016 03:52 PM
HISAT2 vs. TopHat2: Discrepancies between Hg19 & Hg38 Sbamo RNA Sequencing 0 01-26-2016 05:05 AM
Cufflinks and hg38 annotation abisko00 RNA Sequencing 0 09-03-2015 08:48 AM
Alignment on hg19 or hg38 for exome-seq data blancha Bioinformatics 5 04-21-2015 07:22 AM
differences between gtf file and indexing file (hg19) rozitaa Bioinformatics 1 07-11-2013 06:31 AM

Reply
 
Thread Tools
Old 04-20-2017, 06:27 AM   #1
ManoloS7
Member
 
Location: Spain

Join Date: Apr 2017
Posts: 10
Default hg19 vs hg38. Notable differences.

Hi,

Iím working with some clinical exome sequencing data and I have a few questions.

I followed the GATKís best practices pipeline to obtain my vcf files. The thing is that I used both hg19 and hg38 genomes as reference (in parallel) in order to compare and contrast my results. As I read, most of the differences between both versions are in the non-coding regions of the genome, but Iím seeing a lot of them when comparing lists of candidate genes. When I establish my conditions and parse my data according to them, I see that the lists generated using one genome or another are quite different. Besides, Iím facing some issues with the dbSNPís IDs (the rsXXXXX).

Some examples:

The ID rs4940595 appears in both callings, but differently. In hg19 case, this position has a T as the reference allele and a G as the alternative. When I annotate it (VEP, SnpEff) I see that the consequence of this SNP is a stop lost. On the other hand, in hg38 case, this position has a G as the reference allele and a T as the alternative and, when I annotate it, the proposed consequence is a stop gained. This really surprises me.

The ID rs855581 also appears in both calls. But this time the annotation is not the problem, the genotypes are. In hg19 case I find that some individuals are homozygous and some other heterozygous. With hg38, all of them are called as homozygous.

I did a liftOver using UCSC tool to make sure that the IDs are actually well annotated in those positions across versions, and they are.

These two are just a few examples. I want to say that I also find, generally, more variants when using hg38 than with hg19. Besides, even though with hg19 my lists are smaller, there are some variants that I donít see with hg38. They are just gone.

Any thoughts on that?

Thank you so much.
ManoloS7 is offline   Reply With Quote
Old 04-20-2017, 08:01 AM   #2
r.rosati
Member
 
Location: Brazil

Join Date: Aug 2015
Posts: 38
Default

About rs4940595: you're right, the hg19 build includes the (more frequent) T allele, that codes for a stop codon; while GRCh38 includes the less frequent G allele. Anyways, considering that the two alleles are both very frequent in the population (about 65% T and 35% G overall), just leave clear which build you've used for your analysis and be consistent with it; then there's no problem in my opinion.

About rs855581: that's another story. It could be due to several reasons and I might not be the most experienced to comment, but... It could be that one of the two builds includes a pseudogene, and some reads in that position get a low score for matching the pseudogene also (i.e. not being uniquely aligned); and as a consequence these reads aren't counted. Apart of that, the sequence there is identical in the two builds, so it's not a question of difficult regions.

About the other questions, I don't have an answer but an advice: I've learned a lot by visualizing the VCF and BAM of "strange" calls on IGV; you can really grasp how some parameters can affect variant calling.

Last edited by r.rosati; 04-20-2017 at 08:03 AM.
r.rosati is offline   Reply With Quote
Old 04-21-2017, 02:05 AM   #3
ManoloS7
Member
 
Location: Spain

Join Date: Apr 2017
Posts: 10
Default

Quote:
Originally Posted by r.rosati View Post
About rs4940595: you're right, the hg19 build includes the (more frequent) T allele, that codes for a stop codon; while GRCh38 includes the less frequent G allele. Anyways, considering that the two alleles are both very frequent in the population (about 65% T and 35% G overall), just leave clear which build you've used for your analysis and be consistent with it; then there's no problem in my opinion.

About rs855581: that's another story. It could be due to several reasons and I might not be the most experienced to comment, but... It could be that one of the two builds includes a pseudogene, and some reads in that position get a low score for matching the pseudogene also (i.e. not being uniquely aligned); and as a consequence these reads aren't counted. Apart of that, the sequence there is identical in the two builds, so it's not a question of difficult regions.

About the other questions, I don't have an answer but an advice: I've learned a lot by visualizing the VCF and BAM of "strange" calls on IGV; you can really grasp how some parameters can affect variant calling.
Dear r.rosati,

First of all, thank you for your answer.

I'm using IGV to visuallize my data around the area of that SNP (rs855581) and I'm seeing that there are more SNPs mapped around the one in hg19 than the one in hg38, but I'm not sure if that means anything at all.

My real problem here is that I don't know if I can trust hg38 more than hg19 or backwards.
ManoloS7 is offline   Reply With Quote
Old 04-24-2017, 06:00 PM   #4
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 159
Default

hg38 is a corrected and improved version of hg19. You should use the newer and better assembly. You should also specify which version of hg38 you use. The latest version is GRCh38.p10 or in other words hg38 patch 10.
Dario1984 is offline   Reply With Quote
Old 04-25-2017, 12:29 AM   #5
ManoloS7
Member
 
Location: Spain

Join Date: Apr 2017
Posts: 10
Default

Quote:
Originally Posted by Dario1984 View Post
hg38 is a corrected and improved version of hg19. You should use the newer and better assembly. You should also specify which version of hg38 you use. The latest version is GRCh38.p10 or in other words hg38 patch 10.
I was thinking on that when I decided to use both versions. Everybody uses hg19 because it's easier to work with (for exome data), but hg38 is an improved version of the other one.

Do you suggest me to ignore all the results that I obtained with hg19 and to use only the ones with hg38?

Thanks.
ManoloS7 is offline   Reply With Quote
Old 04-25-2017, 07:00 PM   #6
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 159
Default

I agree that there are a lot of publicly available exome sequencing datasets that use hg19. I recommend using the results based on hg38 and converting them into some gene-based naming format, such as BRAF C296T. Then, you can compare mutations to existing analyses using a different genome more easily. ANNOVAR is a good way to do that.
Dario1984 is offline   Reply With Quote
Old 04-26-2017, 02:38 AM   #7
ManoloS7
Member
 
Location: Spain

Join Date: Apr 2017
Posts: 10
Default

Quote:
Originally Posted by Dario1984 View Post
I agree that there are a lot of publicly available exome sequencing datasets that use hg19. I recommend using the results based on hg38 and converting them into some gene-based naming format, such as BRAF C296T. Then, you can compare mutations to existing analyses using a different genome more easily. ANNOVAR is a good way to do that.
I am currently using SnpEff to do that, and I find it a very usefull tool.
The thing is that I'm getting more variants when applying my filtering using hg38 than hg19, but a lot of them are because, for instance, what was A → G in hg19 is now G → A in hg38. This way, a lot of positions that were 0/0 are now 1/1. I mean, what back then was nothing now it's a mutation. What should I trust?

Thanks again.
ManoloS7 is offline   Reply With Quote
Old 04-26-2017, 06:45 AM   #8
r.rosati
Member
 
Location: Brazil

Join Date: Aug 2015
Posts: 38
Default

Quote:
Originally Posted by ManoloS7 View Post
I am currently using SnpEff to do that, and I find it a very usefull tool.
The thing is that I'm getting more variants when applying my filtering using hg38 than hg19, but a lot of them are because, for instance, what was A → G in hg19 is now G → A in hg38. This way, a lot of positions that were 0/0 are now 1/1. I mean, what back then was nothing now it's a mutation. What should I trust?

Thanks again.
Well, specifically on this issue, on one side I would say that 99.99% of the times (not always) this happens when a very common variation in hg19 becomes reference on GRCh38 or vice-versa.
So in a scenario of looking for disease-causing mutations, either you don't get the variant called at all anymore, because it's so common it entered reference, or you got it called as homozygous, but you'll end up filtering it out due to being very common; and in a scenario of comparing genotypes between samples, you can still apply your algorithm if all samples were aligned to the same reference version. And with this in mind, if you had to choose, better align to the more current version.
On the other side, now you've tickled my curiosity about the 0.01% of times when this is not the case, and I'd like to check which positions on hg19 actually represented a rare allele you wouldn't normally filter out of an analysis. And indeed, there are some. Amazing.

Last edited by r.rosati; 04-26-2017 at 06:48 AM.
r.rosati is offline   Reply With Quote
Old 04-26-2017, 07:49 AM   #9
ManoloS7
Member
 
Location: Spain

Join Date: Apr 2017
Posts: 10
Default

Quote:
Originally Posted by r.rosati View Post
Well, specifically on this issue, on one side I would say that 99.99% of the times (not always) this happens when a very common variation in hg19 becomes reference on GRCh38 or vice-versa.
So in a scenario of looking for disease-causing mutations, either you don't get the variant called at all anymore, because it's so common it entered reference, or you got it called as homozygous, but you'll end up filtering it out due to being very common; and in a scenario of comparing genotypes between samples, you can still apply your algorithm if all samples were aligned to the same reference version. And with this in mind, if you had to choose, better align to the more current version.
On the other side, now you've tickled my curiosity about the 0.01% of times when this is not the case, and I'd like to check which positions on hg19 actually represented a rare allele you wouldn't normally filter out of an analysis. And indeed, there are some. Amazing.
I didn't know about that paper and it's, indeed, very interesting.

One of the explanaitions of the problem of the frequencies is that I annotate them using ExAC and, as you may know, this database is in hg19. I did the liftOver of the vcf file they provide in order to annotate my hg38 calling, but I lost some positions. The mutations in those positions seem to be "new" now, so I keep them when filtering.

Even knowing it, I still don't have any real clue why I'm obtaining more mutations with hg38. And, as I said in my first message, "even though with hg19 my lists are smaller, there are some variants that I donít see with hg38. They are just gone."

Thank you again for your help.
ManoloS7 is offline   Reply With Quote
Old 04-26-2017, 09:46 AM   #10
r.rosati
Member
 
Location: Brazil

Join Date: Aug 2015
Posts: 38
Default

Quote:
Originally Posted by ManoloS7 View Post
And, as I said in my first message, "even though with hg19 my lists are smaller, there are some variants that I don’t see with hg38. They are just gone."
I'm definitely not the most knowledgeable around here, but if you would like to provide an example, we could work it out - i.e. an IGV screenshot, using the GRCh38 reference, of a BAM file showing the location corresponding to a variant that was called with hg19 and isn't shown anymore on GRCh38.

Last edited by r.rosati; 04-26-2017 at 09:59 AM.
r.rosati is offline   Reply With Quote
Old 04-27-2017, 11:18 AM   #11
r.rosati
Member
 
Location: Brazil

Join Date: Aug 2015
Posts: 38
Default

...on the good side I should specify that we own an Ion Proton, so over the time I actually became very good at finding out why a variant call went wrong.

Last edited by r.rosati; 04-27-2017 at 11:38 AM.
r.rosati is offline   Reply With Quote
Old 04-28-2017, 04:14 AM   #12
ManoloS7
Member
 
Location: Spain

Join Date: Apr 2017
Posts: 10
Default

Quote:
Originally Posted by r.rosati View Post
...on the good side I should specify that we own an Ion Proton, so over the time I actually became very good at finding out why a variant call went wrong.
OK, I tried to post a screenshot of IGV but I think the message is being reviwed.

I saw that some reads that map in hg19 to a (let's call it) regular chromosome, now in hg38 map to a random chromosome, and that is why I'm loosing them. I don't know why they don't map to the regular one also.
ManoloS7 is offline   Reply With Quote
Old 04-30-2017, 05:00 PM   #13
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 159
Default

Quote:
Originally Posted by ManoloS7 View Post
I saw that some reads that map in hg19 to a (let's call it) regular chromosome, now in hg38 map to a random chromosome, and that is why I'm loosing (sic) them.
What do you mean by random chromosome? The reference assembly doesn't contain random sequences.
Dario1984 is offline   Reply With Quote
Old 05-02-2017, 12:57 AM   #14
ManoloS7
Member
 
Location: Spain

Join Date: Apr 2017
Posts: 10
Default

Quote:
Originally Posted by Dario1984 View Post
What do you mean by random chromosome? The reference assembly doesn't contain random sequences.
For example: On the one hand, I have a read that maps to chr9 in hg19 but, on the other hand, this same read maps to chr19_GL949750v2_alt in hg38.
ManoloS7 is offline   Reply With Quote
Old 05-02-2017, 05:00 PM   #15
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 159
Default

That is not a random chromosome, but an alternative version of part of a chromosome that has lots of variability in the human population. It means that the person you sequenced now has a reference sequence which represents their genome better than hg19 was able to, which only provided one sequence per chromosome without any alternatives.
Dario1984 is offline   Reply With Quote
Old 05-03-2017, 04:39 AM   #16
ManoloS7
Member
 
Location: Spain

Join Date: Apr 2017
Posts: 10
Default

Quote:
Originally Posted by Dario1984 View Post
That is not a random chromosome, but an alternative version of part of a chromosome that has lots of variability in the human population. It means that the person you sequenced now has a reference sequence which represents their genome better than hg19 was able to, which only provided one sequence per chromosome without any alternatives.
I see, thanks.
The problem is that, since those reads are not located all of them in the same loci, there are some variants that are present in hg19 but not in hg38. And that is a problem for my analysis.
ManoloS7 is offline   Reply With Quote
Old 05-03-2017, 11:28 AM   #17
r.rosati
Member
 
Location: Brazil

Join Date: Aug 2015
Posts: 38
Default

A post of mine from two days ago got stuck in moderation limbo; it's a pity because it had some good links.
I was commenting that this thread is pretty interesting.
The sequence you mention is an alternate sequence for the highly polymorphic human KIR locus. As I understand, most alignment algorithms don't handle very well polymorphic regions (alternate scaffolds). So if you align vs. the GRCh38 build plus its alternate contigs, you will see problems like the one you're describing. As I understood, the best option is yes to use the GRCh38 version, but omitting these contigs. That is, you might want to use the "analysis set"; but not the "full" analysis set version (which contains the alternate scaffolds).

Last edited by r.rosati; 05-03-2017 at 11:38 AM.
r.rosati is offline   Reply With Quote
Old 05-04-2017, 12:02 AM   #18
ManoloS7
Member
 
Location: Spain

Join Date: Apr 2017
Posts: 10
Default

Quote:
Originally Posted by r.rosati View Post
A post of mine from two days ago got stuck in moderation limbo; it's a pity because it had some good links.
I was commenting that this thread is pretty interesting.
The sequence you mention is an alternate sequence for the highly polymorphic human KIR locus. As I understand, most alignment algorithms don't handle very well polymorphic regions (alternate scaffolds). So if you align vs. the GRCh38 build plus its alternate contigs, you will see problems like the one you're describing. As I understood, the best option is yes to use the GRCh38 version, but omitting these contigs. That is, you might want to use the "analysis set"; but not the "full" analysis set version (which contains the alternate scaffolds).
Oh, that is very interesting and useful information. I will probably redo my analysis to see if now those reads map to the "original" chromosomes.
Thank you so much.
ManoloS7 is offline   Reply With Quote
Old 05-08-2017, 05:20 AM   #19
evakoe
Member
 
Location: Italia

Join Date: Jul 2012
Posts: 27
Default

Quote:
Originally Posted by ManoloS7 View Post
The thing is that I'm getting more variants when applying my filtering using hg38 than hg19, but a lot of them are because, for instance, what was A → G in hg19 is now G → A in hg38. This way, a lot of positions that were 0/0 are now 1/1. I mean, what back then was nothing now it's a mutation. What should I trust?
I think that the question you are asking is not really meaningful. When it comes the reference at a highly variable site, there is not really a right or wrong. Whether or not the reference is A or G does not matter from a biological perspective. An individual might have an A or G which is either the reference or the alternative allele with respect to either hg19 or hg38, which just means it is a variable (i.e. a variant) site. It's just a question of naming.

Quote:
Originally Posted by ManoloS7 View Post
I want to say that I also find, generally, more variants when using hg38 than with hg19. Besides, even though with hg19 my lists are smaller, there are some variants that I donít see with hg38. They are just gone.
Did you check how different hg19 and hg38 are at the sites where you observe these differences? If the references are highly dissimilar, this would explain your vanishing variants. I think that based on the 1000Genomes project, many sites in hg38 were modified to adjust for previous errors.
evakoe is offline   Reply With Quote
Old 05-08-2017, 06:30 AM   #20
ManoloS7
Member
 
Location: Spain

Join Date: Apr 2017
Posts: 10
Default

Quote:
Originally Posted by evakoe View Post
I think that the question you are asking is not really meaningful. When it comes the reference at a highly variable site, there is not really a right or wrong. Whether or not the reference is A or G does not matter from a biological perspective. An individual might have an A or G which is either the reference or the alternative allele with respect to either hg19 or hg38, which just means it is a variable (i.e. a variant) site. It's just a question of naming.
I mostly but not totally agree with that. Of course it does not matter from a biological point of view and of course it is a question of names. But for my analysis it is very important that I see a variant (0/1 or 1/1) or not (0/0). I know that most changes are in positions in which both options are very frequent in the population, but does this happen in all the cases?

Quote:
Originally Posted by evakoe View Post
Did you check how different hg19 and hg38 are at the sites where you observe these differences? If the references are highly dissimilar, this would explain your vanishing variants. I think that based on the 1000Genomes project, many sites in hg38 were modified to adjust for previous errors.
I did not check that directly but it is obvious to me that there should be notable differences.
ManoloS7 is offline   Reply With Quote
Reply

Tags
dbsnp, differences, hg19, hg38

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:42 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO