Seqanswers Leaderboard Ad

**dpryan** · 05-11-2013, 01:53 AM

The output BAM file from tophat/bowtie uses purely genomic coordinates, so I'm not sure in what step you're seeing anything from the GTF file (unless you're looking at the transcriptome index files). Do you mean htseq-count?

BTW, the gene_name column isn't always unique, so you're often better off with gene ids (trivially convertible in R, which you're presumably using for downstream analysis).

**cnyh** · 05-11-2013, 09:48 AM

Thank you. It is indeed for downstream analysis that I need it (I just assumed it had to be tophat/bowtie that extracted information from the gtf, since I don't provide it to cuffdiff). However, my issue with the trivial conversion is precisely what you say about the gene_name column not being unique: if hypothetical features A and B have different gene ids but the same gene name, then how do I find the total expression value for that one gene? Do I sum all the entries within a sample with that gene name?

**dpryan** · 05-11-2013, 10:19 AM

In my experience, it's generally better to perform all analyses with gene IDs and then just add a gene name annotation at the end. If you're going to do pathway or GO analysis, you're going to need a gene id (of some sort) rather than a gene name anyway, so you'd might as well stick to those.

Regarding simply summing counts. That can certainly work, depending on the exact nature of the question you're asking. That can also hide interesting changes, though I expect that's pretty unusual.

**cnyh** · 05-13-2013, 05:09 AM

What we are looking for is a list of genes that are differentially expressed between different organ metastases. The plan is to go on with biological validation of this gene list using knockout constructs in xenografts. I am unsure as to whether it would be more applicable to stick to gene IDs or the summed gene symbols in this case?

I attempted the task in R, and found that 55000 of the 77000 UCSC gene IDs do not have a corresponding gene symbol. This seems very strange, doesn't it?

**dpryan** · 05-13-2013, 05:15 AM

That seems rather odd. You might post the commands you used for the conversion and a couple examples of non-converting IDs.

**cnyh** · 05-13-2013, 05:20 AM

# Load libraries and files
library(cummeRbund)
cuff <- readCufflinks()
gene.individual <- fpkmMatrix(genes(cuff))
annotation <- read.table("/data/reference/annotation_ucsc-id_gene-symbol2.txt",header=TRUE)
names(annotation) <- c("kgID","geneSymbol")

# Creating error dumps
error.morethanone <- NULL
error.fewerthanone <- NULL

# Add column to gene.individual with new annotation
gene.individual$geneSymbol <- NA
for (id in row.names(gene.individual)) {
x <- length(annotation$geneSymbol[annotation$kgID==id])

if(x<1) error.fewerthanone <- c(error.fewerthanone,id)
if(x>1) error.morethanone <- c(error.morethanone,id)
if(x==1) gene.individual$geneSymbol[row.names(gene.individual)==id] <- as.character(annotation$geneSymbol[annotation$kgID==id])
}

**cnyh** · 05-13-2013, 05:20 AM

> head(error.fewerthanone)
[1] "uc001aab.3" "uc001aac.3" "uc001aae.3" "uc001aah.3" "uc001aak.2"
[6] "uc001aam.3"

**dpryan** · 05-13-2013, 07:58 AM

Ah, UCSC gene IDs, those will always give you headaches. It looks like you are mixing multiple versions of the knownGene database. In the most recent one, uc001aab.3 and uc001aah.3 (as an example) are merged together into uc001aah.4, which probably exists in your ucsc id to gene name table. You might just download kg5ToKg6.txt.gz from UCSC and use it to update one of your annotation files (or just switch to Ensembl, their annotations have given me fewer headaches).

**cnyh** · 05-13-2013, 10:07 AM

Ah! I just recently ran the pipeline with the Ensembl files, so I will try that approach and see whether it goes better. Thanks for the tip

Would you recommend using gene IDs or switching to summed gene symbols with this kind of research question? My assumption was that since we are not interested in any information at the level of individual isoforms, it would be better to sum them together.

**dpryan** · 05-13-2013, 10:17 AM

There's normally a difference between gene id and transcript id, with genes (sometimes with the same name, but then different gene ids) having multiple transcript IDs. If you download the human GTF annotation from Ensembl, you will find this to be the case. In that case, just use the gene id, since you don't care about particular transcripts.

**cnyh** · 05-13-2013, 10:39 AM

I am using the Ensembl GTF annotation, yes. I'm seeing four things in my GTF:
- gene_id
- gene_name
- transcript_id
- transcript_name

My question was primarily about ending up with the gene_name instead of the gene_id?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Tophat/Bowtie not using gene symbols from gtf file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News