Seqanswers Leaderboard Ad

**r.rosati** · 04-20-2017, 08:01 AM

About rs4940595: you're right, the hg19 build includes the (more frequent) T allele, that codes for a stop codon; while GRCh38 includes the less frequent G allele. Anyways, considering that the two alleles are both very frequent in the population (about 65% T and 35% G overall), just leave clear which build you've used for your analysis and be consistent with it; then there's no problem in my opinion.

About rs855581: that's another story. It could be due to several reasons and I might not be the most experienced to comment, but... It could be that one of the two builds includes a pseudogene, and some reads in that position get a low score for matching the pseudogene also (i.e. not being uniquely aligned); and as a consequence these reads aren't counted. Apart of that, the sequence there is identical in the two builds, so it's not a question of difficult regions.

About the other questions, I don't have an answer but an advice: I've learned a lot by visualizing the VCF and BAM of "strange" calls on IGV; you can really grasp how some parameters can affect variant calling.

**ManoloS7** · 04-21-2017, 02:05 AM

Originally posted by r.rosati View Post

About rs4940595: you're right, the hg19 build includes the (more frequent) T allele, that codes for a stop codon; while GRCh38 includes the less frequent G allele. Anyways, considering that the two alleles are both very frequent in the population (about 65% T and 35% G overall), just leave clear which build you've used for your analysis and be consistent with it; then there's no problem in my opinion.

About rs855581: that's another story. It could be due to several reasons and I might not be the most experienced to comment, but... It could be that one of the two builds includes a pseudogene, and some reads in that position get a low score for matching the pseudogene also (i.e. not being uniquely aligned); and as a consequence these reads aren't counted. Apart of that, the sequence there is identical in the two builds, so it's not a question of difficult regions.

About the other questions, I don't have an answer but an advice: I've learned a lot by visualizing the VCF and BAM of "strange" calls on IGV; you can really grasp how some parameters can affect variant calling.

Dear r.rosati,

First of all, thank you for your answer.

I'm using IGV to visuallize my data around the area of that SNP (rs855581) and I'm seeing that there are more SNPs mapped around the one in hg19 than the one in hg38, but I'm not sure if that means anything at all.

My real problem here is that I don't know if I can trust hg38 more than hg19 or backwards.

**Dario1984** · 04-24-2017, 06:00 PM

hg38 is a corrected and improved version of hg19. You should use the newer and better assembly. You should also specify which version of hg38 you use. The latest version is GRCh38.p10 or in other words hg38 patch 10.

**ManoloS7** · 04-25-2017, 12:29 AM

Originally posted by Dario1984 View Post

hg38 is a corrected and improved version of hg19. You should use the newer and better assembly. You should also specify which version of hg38 you use. The latest version is GRCh38.p10 or in other words hg38 patch 10.

I was thinking on that when I decided to use both versions. Everybody uses hg19 because it's easier to work with (for exome data), but hg38 is an improved version of the other one.

Do you suggest me to ignore all the results that I obtained with hg19 and to use only the ones with hg38?

Thanks.

**Dario1984** · 04-25-2017, 07:00 PM

I agree that there are a lot of publicly available exome sequencing datasets that use hg19. I recommend using the results based on hg38 and converting them into some gene-based naming format, such as BRAF C296T. Then, you can compare mutations to existing analyses using a different genome more easily. ANNOVAR is a good way to do that.

**ManoloS7** · 04-26-2017, 02:38 AM

Originally posted by Dario1984 View Post

I agree that there are a lot of publicly available exome sequencing datasets that use hg19. I recommend using the results based on hg38 and converting them into some gene-based naming format, such as BRAF C296T. Then, you can compare mutations to existing analyses using a different genome more easily. ANNOVAR is a good way to do that.

I am currently using SnpEff to do that, and I find it a very usefull tool.
The thing is that I'm getting more variants when applying my filtering using hg38 than hg19, but a lot of them are because, for instance, what was A → G in hg19 is now G → A in hg38. This way, a lot of positions that were 0/0 are now 1/1. I mean, what back then was nothing now it's a mutation. What should I trust?

Thanks again.

**r.rosati** · 04-26-2017, 06:45 AM

Originally posted by ManoloS7 View Post

I am currently using SnpEff to do that, and I find it a very usefull tool.
The thing is that I'm getting more variants when applying my filtering using hg38 than hg19, but a lot of them are because, for instance, what was A → G in hg19 is now G → A in hg38. This way, a lot of positions that were 0/0 are now 1/1. I mean, what back then was nothing now it's a mutation. What should I trust?

Thanks again.

Well, specifically on this issue, on one side I would say that 99.99% of the times (not always) this happens when a very common variation in hg19 becomes reference on GRCh38 or vice-versa.
So in a scenario of looking for disease-causing mutations, either you don't get the variant called at all anymore, because it's so common it entered reference, or you got it called as homozygous, but you'll end up filtering it out due to being very common; and in a scenario of comparing genotypes between samples, you can still apply your algorithm if all samples were aligned to the same reference version. And with this in mind, if you had to choose, better align to the more current version.
On the other side, now you've tickled my curiosity about the 0.01% of times when this is not the case, and I'd like to check which positions on hg19 actually represented a rare allele you wouldn't normally filter out of an analysis. And indeed, there are some. Amazing.

**ManoloS7** · 04-26-2017, 07:49 AM

Originally posted by r.rosati View Post

Well, specifically on this issue, on one side I would say that 99.99% of the times (not always) this happens when a very common variation in hg19 becomes reference on GRCh38 or vice-versa.
So in a scenario of looking for disease-causing mutations, either you don't get the variant called at all anymore, because it's so common it entered reference, or you got it called as homozygous, but you'll end up filtering it out due to being very common; and in a scenario of comparing genotypes between samples, you can still apply your algorithm if all samples were aligned to the same reference version. And with this in mind, if you had to choose, better align to the more current version.
On the other side, now you've tickled my curiosity about the 0.01% of times when this is not the case, and I'd like to check which positions on hg19 actually represented a rare allele you wouldn't normally filter out of an analysis. And indeed, there are some. Amazing.

I didn't know about that paper and it's, indeed, very interesting.

One of the explanaitions of the problem of the frequencies is that I annotate them using ExAC and, as you may know, this database is in hg19. I did the liftOver of the vcf file they provide in order to annotate my hg38 calling, but I lost some positions. The mutations in those positions seem to be "new" now, so I keep them when filtering.

Even knowing it, I still don't have any real clue why I'm obtaining more mutations with hg38. And, as I said in my first message, "even though with hg19 my lists are smaller, there are some variants that I don’t see with hg38. They are just gone."

Thank you again for your help.

**r.rosati** · 04-26-2017, 09:46 AM

Originally posted by ManoloS7 View Post

And, as I said in my first message, "even though with hg19 my lists are smaller, there are some variants that I don’t see with hg38. They are just gone."

I'm definitely not the most knowledgeable around here, but if you would like to provide an example, we could work it out - i.e. an IGV screenshot, using the GRCh38 reference, of a BAM file showing the location corresponding to a variant that was called with hg19 and isn't shown anymore on GRCh38.

**ManoloS7** · 04-27-2017, 02:16 AM

Originally posted by r.rosati View Post

I'm definitely not the most knowledgeable around here, but if you would like to provide an example, we could work it out - i.e. an IGV screenshot, using the GRCh38 reference, of a BAM file showing the location corresponding to a variant that was called with hg19 and isn't shown anymore on GRCh38.

This is the IGV screenshot from a bam file of a sample with hg19:

And this one, the same sample with hg38 in the same region (coordinates obtained from UCSC liftOver).

**r.rosati** · 04-27-2017, 11:18 AM

...on the good side I should specify that we own an Ion Proton, so over the time I actually became very good at finding out why a variant call went wrong.

**ManoloS7** · 04-28-2017, 04:14 AM

Originally posted by r.rosati View Post

...on the good side I should specify that we own an Ion Proton, so over the time I actually became very good at finding out why a variant call went wrong.

OK, I tried to post a screenshot of IGV but I think the message is being reviwed.

I saw that some reads that map in hg19 to a (let's call it) regular chromosome, now in hg38 map to a random chromosome, and that is why I'm loosing them. I don't know why they don't map to the regular one also.

**Dario1984** · 04-30-2017, 05:00 PM

Originally posted by ManoloS7 View Post

I saw that some reads that map in hg19 to a (let's call it) regular chromosome, now in hg38 map to a random chromosome, and that is why I'm loosing (sic) them.

What do you mean by random chromosome? The reference assembly doesn't contain random sequences.

**ManoloS7** · 05-02-2017, 12:57 AM

Originally posted by Dario1984 View Post

What do you mean by random chromosome? The reference assembly doesn't contain random sequences.

For example: On the one hand, I have a read that maps to chr9 in hg19 but, on the other hand, this same read maps to chr19_GL949750v2_alt in hg38.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

hg19 vs hg38. Notable differences.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News