SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
An error about annotation exon number of gene in tophat-fusion report louis7781x Bioinformatics 1 02-07-2014 10:42 AM
question related to Annovar annotation kjaja Bioinformatics 4 03-27-2012 02:32 AM
Fragment length limits? krobison Ion Torrent 18 12-08-2011 10:49 AM
finding exon numbers in fasta exon file efoss Bioinformatics 1 10-20-2011 03:57 PM
genome annotation question slny Bioinformatics 3 06-01-2011 07:26 PM

Reply
 
Thread Tools
Old 07-25-2012, 02:38 PM   #1
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Question Exomes: Question about exon limits and annotation..

Lets say I have processed raw reads from a tumor-normal paired exome experiment and made them fit for mutation calling. I have two bam files that I feed into a mutation caller and since its an exome experiment, I limit the variant calls to mutations limited to exons + 10 bases only by generating a .bed file of refgenes from the UCSC table browser.

Now, theoretically all the mutation calls made by the caller are exonic or splicing.

But when I run these calls through an annotation software and annotate it against a refgene set (tried both snpEff and Annovar (with annovar I used the default hg19 set)), only approximately 65%-80% of the calls are exonic or splicing. The rest are annotated as intronic, upstream, downstream and a zillion other things..

I have been trying to think of an explanation as to why. But I just cant.

Has anybody here noticed this before? Is there an explanation as to why this is happening?

Thank you.

Shyam.

PS: Its not a problem with the mutation caller either; I have tried 2 of them..

Last edited by shyam_la; 07-25-2012 at 03:03 PM.
shyam_la is offline   Reply With Quote
Old 07-25-2012, 03:46 PM   #2
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

I'm wondering if this is the product of alternative splicing in your annotation/sequencing set. For example, maybe an alternative 3' terminal exon could lead something being called as upstream? Or a skipped exon, could lead to an intronic call?

I'd overlay your SNP calls with the annotation in something like IGV and see if you can visualize what might be the reason.
Wallysb01 is offline   Reply With Quote
Old 07-25-2012, 03:53 PM   #3
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Thanks for responding.
That doesn't make sense to me really..
Everything is already aligned to the reference genome. Base 12345678 is going to be intronic or exonic, and hence mutation at base 12345678 is going to be either intronic or exonic respectively, irrespective of how different isoforms are spliced. Isn't it??
Alternative splicing will affect which exons are there in the protein, but can't affect where exactly a particular aligned base position falls in the genome structure, right?
Could I be doing something wrong post-mutation calling that is leading to this effect?
shyam_la is offline   Reply With Quote
Old 07-25-2012, 04:33 PM   #4
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

You're right about whether or not a mutation at a specific base in exon or intron should be irrespective of the isoform, assuming everything is working as you think it is and being treated consistently. However, what I'm wondering is if the annotation files and the sequencing all had the same isoforms annotated, or even if the programs are handling these annotation files equivalently. That's why I'd say just go look at it in IGV. If you can visually see nothing but exon/splicing SNPs, you'll know its a problem with how these programs are calling SNPs relative to the annotation files. Then if you visualize SNPs outside the exon/splicing regions, then you know its something wrong with your initial screen.
Wallysb01 is offline   Reply With Quote
Old 07-25-2012, 04:41 PM   #5
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Sounds like an idea! Will update asap..
Thanks.
One question: Can IGV visualise just a list of chr and base positions? I hve one column with chromosomes and one column with base positions on the chromosome (I have 2 more columns with reference allele and observed allele, but those are irrelevant for the purposes of our discussion)..
shyam_la is offline   Reply With Quote
Old 07-25-2012, 04:49 PM   #6
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

I'm note sure if IGV could load what you want. Here's a list of the supported file formats: http://www.broadinstitute.org/software/igv/FileFormats

If you can convert your data to VCF, that would work? You might have to do some file format manipulation to get it working. Or maybe find a more flexible viewer.
Wallysb01 is offline   Reply With Quote
Old 07-25-2012, 04:59 PM   #7
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Well, I just picked 10 random spots from the list and tested them individually.. Half the time the annotation is correct (comparing to IGV), half the time its not..

I can't draw any conclusions yet..

What I am thinking is the UCSC refgene bed file, the refgene set of annovar and the refgene set used by IGV are all different. Is that possible?
shyam_la is offline   Reply With Quote
Old 07-25-2012, 05:03 PM   #8
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

It is certainly possible, especially when it comes to chromosome naming schemes. You should try to standardize on one set. Which can sound easy, but often isn't.
Wallysb01 is offline   Reply With Quote
Old 07-25-2012, 05:17 PM   #9
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

I tested a few more loci. Annovar and IGV compare well, on intron vs exon, but not so well when its UTR5/UTR3 vs exon.. Some UTR annotated sites fall within IGV exons..

I don't think standardization of the kind you are talking about is even possible. Only way to do it is if I somehow get a bed file that is exactly the same as the annovar annotation set or conversely get annovar to somehow make use of the bed file for its annotation set.. Any experience doing that?
shyam_la is offline   Reply With Quote
Old 07-25-2012, 05:24 PM   #10
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Hmm, sounds like its just the additional layer of information that is causing the confusion (exons can be coding or UTR, but IGV stops at the exon level). Without knowing more about the file formats you're using its hard for me to say what is best. Are the annotation files just gtf/gff3s that need converting to bed? If so, that's pretty straight forwarding using a number of tools you could google (genome annotation programs often have these converters as part of their source code).

If you can give me more information and maybe the first 10 lines of the files you're using I could try to make some sense of it.
Wallysb01 is offline   Reply With Quote
Old 07-25-2012, 05:37 PM   #11
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

They are actually in standard UCSC refgene txt fomat. If you go to table browser, and try to export all fields in selected table as plain text, thats the kind of file. But the file that annovar uses is different (thats the working explanation now) from the file on the UCSC browser currently..
I think I have figured a way out already. Involves getting the file from the browser right now, replacing the original annovar file and using "retrieve_seq_from_fasta.pl" that comes with annovar. Will update if that solves issue..
shyam_la is offline   Reply With Quote
Old 07-25-2012, 06:37 PM   #12
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Script doesn't work really. Still stuck with no solution in sight..
shyam_la is offline   Reply With Quote
Old 07-26-2012, 01:12 PM   #13
thedavid
Junior Member
 
Location: MD

Join Date: Jul 2011
Posts: 9
Default

Is it a strand issue? I've made that mistake before......
thedavid is offline   Reply With Quote
Old 07-26-2012, 01:16 PM   #14
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Can you please explain?
shyam_la is offline   Reply With Quote
Old 07-26-2012, 01:25 PM   #15
thedavid
Junior Member
 
Location: MD

Join Date: Jul 2011
Posts: 9
Default

Sorry, first let me start off by saying it's possible I've totally mis-understood your issue.
Second, if you fail to deal with the strand of your features (ie is the gene on the positive or negative strand; or in other, words Watson or Crick) you can screw up how you map the coordinates back to your data.
thedavid is offline   Reply With Quote
Old 07-26-2012, 03:04 PM   #16
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

But is that relevant in this context??
Assuming you have now understood my question..
shyam_la is offline   Reply With Quote
Old 07-26-2012, 03:09 PM   #17
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

it certainly could be. But your numbers don't seem to make sense if you're assigning random strandedness to the annotation. The SNP obviously is not strand-specific.

Unfortunately, this just looks like a file format issue that will require some trouble shooting.
Wallysb01 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:44 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO