Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question about exon limits and annotation..

    Lets say I have processed raw reads from a tumor-normal paired exome experiment and made them fit for mutation calling. I have two bam files that I feed into a mutation caller and since its an exome experiment, I limit the variant calls to mutations limited to exons + 10 bases only by generating a .bed file of refgenes from the UCSC table browser.

    Now, theoretically all the mutation calls made by the caller are exonic or splicing.

    But when I run these calls through an annotation software and annotate it against a refgene set (tried both snpEff and Annovar (with annovar I used the default hg19 set)), only approximately 65%-80% of the calls are exonic or splicing. The rest are annotated as intronic, upstream, downstream and a zillion other things..

    I have been trying to think of an explanation as to why. But I just cant.

    Has anybody here noticed this before? Is there an explanation as to why this is happening?

    Thank you.

    Shyam.

    PS: Its not a problem with the mutation caller either; I have tried 2 of them..
    Last edited by shyam_la; 07-25-2012, 03:03 PM.

  • #2
    I'm wondering if this is the product of alternative splicing in your annotation/sequencing set. For example, maybe an alternative 3' terminal exon could lead something being called as upstream? Or a skipped exon, could lead to an intronic call?

    I'd overlay your SNP calls with the annotation in something like IGV and see if you can visualize what might be the reason.

    Comment


    • #3
      Thanks for responding.
      That doesn't make sense to me really..
      Everything is already aligned to the reference genome. Base 12345678 is going to be intronic or exonic, and hence mutation at base 12345678 is going to be either intronic or exonic respectively, irrespective of how different isoforms are spliced. Isn't it??
      Alternative splicing will affect which exons are there in the protein, but can't affect where exactly a particular aligned base position falls in the genome structure, right?
      Could I be doing something wrong post-mutation calling that is leading to this effect?

      Comment


      • #4
        You're right about whether or not a mutation at a specific base in exon or intron should be irrespective of the isoform, assuming everything is working as you think it is and being treated consistently. However, what I'm wondering is if the annotation files and the sequencing all had the same isoforms annotated, or even if the programs are handling these annotation files equivalently. That's why I'd say just go look at it in IGV. If you can visually see nothing but exon/splicing SNPs, you'll know its a problem with how these programs are calling SNPs relative to the annotation files. Then if you visualize SNPs outside the exon/splicing regions, then you know its something wrong with your initial screen.

        Comment


        • #5
          Sounds like an idea! Will update asap..
          Thanks.
          One question: Can IGV visualise just a list of chr and base positions? I hve one column with chromosomes and one column with base positions on the chromosome (I have 2 more columns with reference allele and observed allele, but those are irrelevant for the purposes of our discussion)..

          Comment


          • #6
            I'm note sure if IGV could load what you want. Here's a list of the supported file formats: http://www.broadinstitute.org/software/igv/FileFormats

            If you can convert your data to VCF, that would work? You might have to do some file format manipulation to get it working. Or maybe find a more flexible viewer.

            Comment


            • #7
              Well, I just picked 10 random spots from the list and tested them individually.. Half the time the annotation is correct (comparing to IGV), half the time its not..

              I can't draw any conclusions yet..

              What I am thinking is the UCSC refgene bed file, the refgene set of annovar and the refgene set used by IGV are all different. Is that possible?

              Comment


              • #8
                It is certainly possible, especially when it comes to chromosome naming schemes. You should try to standardize on one set. Which can sound easy, but often isn't.

                Comment


                • #9
                  I tested a few more loci. Annovar and IGV compare well, on intron vs exon, but not so well when its UTR5/UTR3 vs exon.. Some UTR annotated sites fall within IGV exons..

                  I don't think standardization of the kind you are talking about is even possible. Only way to do it is if I somehow get a bed file that is exactly the same as the annovar annotation set or conversely get annovar to somehow make use of the bed file for its annotation set.. Any experience doing that?

                  Comment


                  • #10
                    Hmm, sounds like its just the additional layer of information that is causing the confusion (exons can be coding or UTR, but IGV stops at the exon level). Without knowing more about the file formats you're using its hard for me to say what is best. Are the annotation files just gtf/gff3s that need converting to bed? If so, that's pretty straight forwarding using a number of tools you could google (genome annotation programs often have these converters as part of their source code).

                    If you can give me more information and maybe the first 10 lines of the files you're using I could try to make some sense of it.

                    Comment


                    • #11
                      They are actually in standard UCSC refgene txt fomat. If you go to table browser, and try to export all fields in selected table as plain text, thats the kind of file. But the file that annovar uses is different (thats the working explanation now) from the file on the UCSC browser currently..
                      I think I have figured a way out already. Involves getting the file from the browser right now, replacing the original annovar file and using "retrieve_seq_from_fasta.pl" that comes with annovar. Will update if that solves issue..

                      Comment


                      • #12
                        Script doesn't work really. Still stuck with no solution in sight..

                        Comment


                        • #13
                          Is it a strand issue? I've made that mistake before......

                          Comment


                          • #14
                            Can you please explain?

                            Comment


                            • #15
                              Sorry, first let me start off by saying it's possible I've totally mis-understood your issue.
                              Second, if you fail to deal with the strand of your features (ie is the gene on the positive or negative strand; or in other, words Watson or Crick) you can screw up how you map the coordinates back to your data.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X