Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • hg19 vs hg38. Notable differences.

    Hi,

    I’m working with some clinical exome sequencing data and I have a few questions.

    I followed the GATK’s best practices pipeline to obtain my vcf files. The thing is that I used both hg19 and hg38 genomes as reference (in parallel) in order to compare and contrast my results. As I read, most of the differences between both versions are in the non-coding regions of the genome, but I’m seeing a lot of them when comparing lists of candidate genes. When I establish my conditions and parse my data according to them, I see that the lists generated using one genome or another are quite different. Besides, I’m facing some issues with the dbSNP’s IDs (the rsXXXXX).

    Some examples:

    The ID rs4940595 appears in both callings, but differently. In hg19 case, this position has a T as the reference allele and a G as the alternative. When I annotate it (VEP, SnpEff) I see that the consequence of this SNP is a stop lost. On the other hand, in hg38 case, this position has a G as the reference allele and a T as the alternative and, when I annotate it, the proposed consequence is a stop gained. This really surprises me.

    The ID rs855581 also appears in both calls. But this time the annotation is not the problem, the genotypes are. In hg19 case I find that some individuals are homozygous and some other heterozygous. With hg38, all of them are called as homozygous.

    I did a liftOver using UCSC tool to make sure that the IDs are actually well annotated in those positions across versions, and they are.

    These two are just a few examples. I want to say that I also find, generally, more variants when using hg38 than with hg19. Besides, even though with hg19 my lists are smaller, there are some variants that I don’t see with hg38. They are just gone.

    Any thoughts on that?

    Thank you so much.

  • #2
    About rs4940595: you're right, the hg19 build includes the (more frequent) T allele, that codes for a stop codon; while GRCh38 includes the less frequent G allele. Anyways, considering that the two alleles are both very frequent in the population (about 65% T and 35% G overall), just leave clear which build you've used for your analysis and be consistent with it; then there's no problem in my opinion.

    About rs855581: that's another story. It could be due to several reasons and I might not be the most experienced to comment, but... It could be that one of the two builds includes a pseudogene, and some reads in that position get a low score for matching the pseudogene also (i.e. not being uniquely aligned); and as a consequence these reads aren't counted. Apart of that, the sequence there is identical in the two builds, so it's not a question of difficult regions.

    About the other questions, I don't have an answer but an advice: I've learned a lot by visualizing the VCF and BAM of "strange" calls on IGV; you can really grasp how some parameters can affect variant calling.
    Last edited by r.rosati; 04-20-2017, 08:03 AM.

    Comment


    • #3
      Originally posted by r.rosati View Post
      About rs4940595: you're right, the hg19 build includes the (more frequent) T allele, that codes for a stop codon; while GRCh38 includes the less frequent G allele. Anyways, considering that the two alleles are both very frequent in the population (about 65% T and 35% G overall), just leave clear which build you've used for your analysis and be consistent with it; then there's no problem in my opinion.

      About rs855581: that's another story. It could be due to several reasons and I might not be the most experienced to comment, but... It could be that one of the two builds includes a pseudogene, and some reads in that position get a low score for matching the pseudogene also (i.e. not being uniquely aligned); and as a consequence these reads aren't counted. Apart of that, the sequence there is identical in the two builds, so it's not a question of difficult regions.

      About the other questions, I don't have an answer but an advice: I've learned a lot by visualizing the VCF and BAM of "strange" calls on IGV; you can really grasp how some parameters can affect variant calling.
      Dear r.rosati,

      First of all, thank you for your answer.

      I'm using IGV to visuallize my data around the area of that SNP (rs855581) and I'm seeing that there are more SNPs mapped around the one in hg19 than the one in hg38, but I'm not sure if that means anything at all.

      My real problem here is that I don't know if I can trust hg38 more than hg19 or backwards.

      Comment


      • #4
        hg38 is a corrected and improved version of hg19. You should use the newer and better assembly. You should also specify which version of hg38 you use. The latest version is GRCh38.p10 or in other words hg38 patch 10.

        Comment


        • #5
          Originally posted by Dario1984 View Post
          hg38 is a corrected and improved version of hg19. You should use the newer and better assembly. You should also specify which version of hg38 you use. The latest version is GRCh38.p10 or in other words hg38 patch 10.
          I was thinking on that when I decided to use both versions. Everybody uses hg19 because it's easier to work with (for exome data), but hg38 is an improved version of the other one.

          Do you suggest me to ignore all the results that I obtained with hg19 and to use only the ones with hg38?

          Thanks.

          Comment


          • #6
            I agree that there are a lot of publicly available exome sequencing datasets that use hg19. I recommend using the results based on hg38 and converting them into some gene-based naming format, such as BRAF C296T. Then, you can compare mutations to existing analyses using a different genome more easily. ANNOVAR is a good way to do that.

            Comment


            • #7
              Originally posted by Dario1984 View Post
              I agree that there are a lot of publicly available exome sequencing datasets that use hg19. I recommend using the results based on hg38 and converting them into some gene-based naming format, such as BRAF C296T. Then, you can compare mutations to existing analyses using a different genome more easily. ANNOVAR is a good way to do that.
              I am currently using SnpEff to do that, and I find it a very usefull tool.
              The thing is that I'm getting more variants when applying my filtering using hg38 than hg19, but a lot of them are because, for instance, what was A → G in hg19 is now G → A in hg38. This way, a lot of positions that were 0/0 are now 1/1. I mean, what back then was nothing now it's a mutation. What should I trust?

              Thanks again.

              Comment


              • #8
                Originally posted by ManoloS7 View Post
                I am currently using SnpEff to do that, and I find it a very usefull tool.
                The thing is that I'm getting more variants when applying my filtering using hg38 than hg19, but a lot of them are because, for instance, what was A → G in hg19 is now G → A in hg38. This way, a lot of positions that were 0/0 are now 1/1. I mean, what back then was nothing now it's a mutation. What should I trust?

                Thanks again.
                Well, specifically on this issue, on one side I would say that 99.99% of the times (not always) this happens when a very common variation in hg19 becomes reference on GRCh38 or vice-versa.
                So in a scenario of looking for disease-causing mutations, either you don't get the variant called at all anymore, because it's so common it entered reference, or you got it called as homozygous, but you'll end up filtering it out due to being very common; and in a scenario of comparing genotypes between samples, you can still apply your algorithm if all samples were aligned to the same reference version. And with this in mind, if you had to choose, better align to the more current version.
                On the other side, now you've tickled my curiosity about the 0.01% of times when this is not the case, and I'd like to check which positions on hg19 actually represented a rare allele you wouldn't normally filter out of an analysis. And indeed, there are some. Amazing.
                Last edited by r.rosati; 04-26-2017, 06:48 AM.

                Comment


                • #9
                  Originally posted by r.rosati View Post
                  Well, specifically on this issue, on one side I would say that 99.99% of the times (not always) this happens when a very common variation in hg19 becomes reference on GRCh38 or vice-versa.
                  So in a scenario of looking for disease-causing mutations, either you don't get the variant called at all anymore, because it's so common it entered reference, or you got it called as homozygous, but you'll end up filtering it out due to being very common; and in a scenario of comparing genotypes between samples, you can still apply your algorithm if all samples were aligned to the same reference version. And with this in mind, if you had to choose, better align to the more current version.
                  On the other side, now you've tickled my curiosity about the 0.01% of times when this is not the case, and I'd like to check which positions on hg19 actually represented a rare allele you wouldn't normally filter out of an analysis. And indeed, there are some. Amazing.
                  I didn't know about that paper and it's, indeed, very interesting.

                  One of the explanaitions of the problem of the frequencies is that I annotate them using ExAC and, as you may know, this database is in hg19. I did the liftOver of the vcf file they provide in order to annotate my hg38 calling, but I lost some positions. The mutations in those positions seem to be "new" now, so I keep them when filtering.

                  Even knowing it, I still don't have any real clue why I'm obtaining more mutations with hg38. And, as I said in my first message, "even though with hg19 my lists are smaller, there are some variants that I don’t see with hg38. They are just gone."

                  Thank you again for your help.

                  Comment


                  • #10
                    Originally posted by ManoloS7 View Post
                    And, as I said in my first message, "even though with hg19 my lists are smaller, there are some variants that I don’t see with hg38. They are just gone."
                    I'm definitely not the most knowledgeable around here, but if you would like to provide an example, we could work it out - i.e. an IGV screenshot, using the GRCh38 reference, of a BAM file showing the location corresponding to a variant that was called with hg19 and isn't shown anymore on GRCh38.
                    Last edited by r.rosati; 04-26-2017, 09:59 AM.

                    Comment


                    • #11
                      Originally posted by r.rosati View Post
                      I'm definitely not the most knowledgeable around here, but if you would like to provide an example, we could work it out - i.e. an IGV screenshot, using the GRCh38 reference, of a BAM file showing the location corresponding to a variant that was called with hg19 and isn't shown anymore on GRCh38.
                      This is the IGV screenshot from a bam file of a sample with hg19:



                      And this one, the same sample with hg38 in the same region (coordinates obtained from UCSC liftOver).

                      Comment


                      • #12
                        ...on the good side I should specify that we own an Ion Proton, so over the time I actually became very good at finding out why a variant call went wrong.
                        Last edited by r.rosati; 04-27-2017, 11:38 AM.

                        Comment


                        • #13
                          Originally posted by r.rosati View Post
                          ...on the good side I should specify that we own an Ion Proton, so over the time I actually became very good at finding out why a variant call went wrong.
                          OK, I tried to post a screenshot of IGV but I think the message is being reviwed.

                          I saw that some reads that map in hg19 to a (let's call it) regular chromosome, now in hg38 map to a random chromosome, and that is why I'm loosing them. I don't know why they don't map to the regular one also.

                          Comment


                          • #14
                            Originally posted by ManoloS7 View Post
                            I saw that some reads that map in hg19 to a (let's call it) regular chromosome, now in hg38 map to a random chromosome, and that is why I'm loosing (sic) them.
                            What do you mean by random chromosome? The reference assembly doesn't contain random sequences.

                            Comment


                            • #15
                              Originally posted by Dario1984 View Post
                              What do you mean by random chromosome? The reference assembly doesn't contain random sequences.
                              For example: On the one hand, I have a read that maps to chr9 in hg19 but, on the other hand, this same read maps to chr19_GL949750v2_alt in hg38.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X