SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
GTF file with gene name attribute for Cuffcompare ChrisL Bioinformatics 15 04-15-2013 08:21 AM
how to run tophat without gtf file and annotate the gene mehtaaditya Bioinformatics 0 03-19-2013 09:53 AM
GTF file with one line per gene? cnyh RNA Sequencing 3 03-06-2013 09:36 AM
GTF file from UCSC with Gene name??? golharam General 2 09-17-2012 12:28 PM
Tophat building Bowtie index from gtf file Aholton RNA Sequencing 5 08-31-2012 01:18 PM

Reply
 
Thread Tools
Old 05-11-2013, 01:12 AM   #1
cnyh
Member
 
Location: Norway

Join Date: Feb 2013
Posts: 39
Default Tophat/Bowtie not using gene symbols from gtf file

I have supplied Tophat/Bowtie with a gtf file from Ensembl. However, instead of making use of the gene symbols (the column "gene_name" in the gtf file, for example "DDX11L1"), it seems to use the Ensembl gene ID's instead (the column "gene_id" in the gtf file, for example "ENSG00000223972").

How do I get Tophat/Bowtie to use the "gene_name" column instead? Is this possible?
cnyh is offline   Reply With Quote
Old 05-11-2013, 02:53 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The output BAM file from tophat/bowtie uses purely genomic coordinates, so I'm not sure in what step you're seeing anything from the GTF file (unless you're looking at the transcriptome index files). Do you mean htseq-count?

BTW, the gene_name column isn't always unique, so you're often better off with gene ids (trivially convertible in R, which you're presumably using for downstream analysis).
dpryan is offline   Reply With Quote
Old 05-11-2013, 10:48 AM   #3
cnyh
Member
 
Location: Norway

Join Date: Feb 2013
Posts: 39
Default

Thank you. It is indeed for downstream analysis that I need it (I just assumed it had to be tophat/bowtie that extracted information from the gtf, since I don't provide it to cuffdiff). However, my issue with the trivial conversion is precisely what you say about the gene_name column not being unique: if hypothetical features A and B have different gene ids but the same gene name, then how do I find the total expression value for that one gene? Do I sum all the entries within a sample with that gene name?
cnyh is offline   Reply With Quote
Old 05-11-2013, 11:19 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

In my experience, it's generally better to perform all analyses with gene IDs and then just add a gene name annotation at the end. If you're going to do pathway or GO analysis, you're going to need a gene id (of some sort) rather than a gene name anyway, so you'd might as well stick to those.

Regarding simply summing counts. That can certainly work, depending on the exact nature of the question you're asking. That can also hide interesting changes, though I expect that's pretty unusual.
dpryan is offline   Reply With Quote
Old 05-13-2013, 06:09 AM   #5
cnyh
Member
 
Location: Norway

Join Date: Feb 2013
Posts: 39
Default

What we are looking for is a list of genes that are differentially expressed between different organ metastases. The plan is to go on with biological validation of this gene list using knockout constructs in xenografts. I am unsure as to whether it would be more applicable to stick to gene IDs or the summed gene symbols in this case?

I attempted the task in R, and found that 55000 of the 77000 UCSC gene IDs do not have a corresponding gene symbol. This seems very strange, doesn't it?
cnyh is offline   Reply With Quote
Old 05-13-2013, 06:15 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

That seems rather odd. You might post the commands you used for the conversion and a couple examples of non-converting IDs.
dpryan is offline   Reply With Quote
Old 05-13-2013, 06:20 AM   #7
cnyh
Member
 
Location: Norway

Join Date: Feb 2013
Posts: 39
Default

# Load libraries and files
library(cummeRbund)
cuff <- readCufflinks()
gene.individual <- fpkmMatrix(genes(cuff))
annotation <- read.table("/data/reference/annotation_ucsc-id_gene-symbol2.txt",header=TRUE)
names(annotation) <- c("kgID","geneSymbol")

# Creating error dumps
error.morethanone <- NULL
error.fewerthanone <- NULL

# Add column to gene.individual with new annotation
gene.individual$geneSymbol <- NA
for (id in row.names(gene.individual)) {
x <- length(annotation$geneSymbol[annotation$kgID==id])

if(x<1) error.fewerthanone <- c(error.fewerthanone,id)
if(x>1) error.morethanone <- c(error.morethanone,id)
if(x==1) gene.individual$geneSymbol[row.names(gene.individual)==id] <- as.character(annotation$geneSymbol[annotation$kgID==id])
}
cnyh is offline   Reply With Quote
Old 05-13-2013, 06:20 AM   #8
cnyh
Member
 
Location: Norway

Join Date: Feb 2013
Posts: 39
Default

> head(error.fewerthanone)
[1] "uc001aab.3" "uc001aac.3" "uc001aae.3" "uc001aah.3" "uc001aak.2"
[6] "uc001aam.3"
cnyh is offline   Reply With Quote
Old 05-13-2013, 08:58 AM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Ah, UCSC gene IDs, those will always give you headaches. It looks like you are mixing multiple versions of the knownGene database. In the most recent one, uc001aab.3 and uc001aah.3 (as an example) are merged together into uc001aah.4, which probably exists in your ucsc id to gene name table. You might just download kg5ToKg6.txt.gz from UCSC and use it to update one of your annotation files (or just switch to Ensembl, their annotations have given me fewer headaches).
dpryan is offline   Reply With Quote
Old 05-13-2013, 11:07 AM   #10
cnyh
Member
 
Location: Norway

Join Date: Feb 2013
Posts: 39
Default

Ah! I just recently ran the pipeline with the Ensembl files, so I will try that approach and see whether it goes better. Thanks for the tip

Would you recommend using gene IDs or switching to summed gene symbols with this kind of research question? My assumption was that since we are not interested in any information at the level of individual isoforms, it would be better to sum them together.
cnyh is offline   Reply With Quote
Old 05-13-2013, 11:17 AM   #11
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

There's normally a difference between gene id and transcript id, with genes (sometimes with the same name, but then different gene ids) having multiple transcript IDs. If you download the human GTF annotation from Ensembl, you will find this to be the case. In that case, just use the gene id, since you don't care about particular transcripts.
dpryan is offline   Reply With Quote
Old 05-13-2013, 11:39 AM   #12
cnyh
Member
 
Location: Norway

Join Date: Feb 2013
Posts: 39
Default

I am using the Ensembl GTF annotation, yes. I'm seeing four things in my GTF:
- gene_id
- gene_name
- transcript_id
- transcript_name

My question was primarily about ending up with the gene_name instead of the gene_id?
cnyh is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:00 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO