Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat/Bowtie not using gene symbols from gtf file

    I have supplied Tophat/Bowtie with a gtf file from Ensembl. However, instead of making use of the gene symbols (the column "gene_name" in the gtf file, for example "DDX11L1"), it seems to use the Ensembl gene ID's instead (the column "gene_id" in the gtf file, for example "ENSG00000223972").

    How do I get Tophat/Bowtie to use the "gene_name" column instead? Is this possible?

  • #2
    The output BAM file from tophat/bowtie uses purely genomic coordinates, so I'm not sure in what step you're seeing anything from the GTF file (unless you're looking at the transcriptome index files). Do you mean htseq-count?

    BTW, the gene_name column isn't always unique, so you're often better off with gene ids (trivially convertible in R, which you're presumably using for downstream analysis).

    Comment


    • #3
      Thank you. It is indeed for downstream analysis that I need it (I just assumed it had to be tophat/bowtie that extracted information from the gtf, since I don't provide it to cuffdiff). However, my issue with the trivial conversion is precisely what you say about the gene_name column not being unique: if hypothetical features A and B have different gene ids but the same gene name, then how do I find the total expression value for that one gene? Do I sum all the entries within a sample with that gene name?

      Comment


      • #4
        In my experience, it's generally better to perform all analyses with gene IDs and then just add a gene name annotation at the end. If you're going to do pathway or GO analysis, you're going to need a gene id (of some sort) rather than a gene name anyway, so you'd might as well stick to those.

        Regarding simply summing counts. That can certainly work, depending on the exact nature of the question you're asking. That can also hide interesting changes, though I expect that's pretty unusual.

        Comment


        • #5
          What we are looking for is a list of genes that are differentially expressed between different organ metastases. The plan is to go on with biological validation of this gene list using knockout constructs in xenografts. I am unsure as to whether it would be more applicable to stick to gene IDs or the summed gene symbols in this case?

          I attempted the task in R, and found that 55000 of the 77000 UCSC gene IDs do not have a corresponding gene symbol. This seems very strange, doesn't it?

          Comment


          • #6
            That seems rather odd. You might post the commands you used for the conversion and a couple examples of non-converting IDs.

            Comment


            • #7
              # Load libraries and files
              library(cummeRbund)
              cuff <- readCufflinks()
              gene.individual <- fpkmMatrix(genes(cuff))
              annotation <- read.table("/data/reference/annotation_ucsc-id_gene-symbol2.txt",header=TRUE)
              names(annotation) <- c("kgID","geneSymbol")

              # Creating error dumps
              error.morethanone <- NULL
              error.fewerthanone <- NULL

              # Add column to gene.individual with new annotation
              gene.individual$geneSymbol <- NA
              for (id in row.names(gene.individual)) {
              x <- length(annotation$geneSymbol[annotation$kgID==id])

              if(x<1) error.fewerthanone <- c(error.fewerthanone,id)
              if(x>1) error.morethanone <- c(error.morethanone,id)
              if(x==1) gene.individual$geneSymbol[row.names(gene.individual)==id] <- as.character(annotation$geneSymbol[annotation$kgID==id])
              }

              Comment


              • #8
                > head(error.fewerthanone)
                [1] "uc001aab.3" "uc001aac.3" "uc001aae.3" "uc001aah.3" "uc001aak.2"
                [6] "uc001aam.3"

                Comment


                • #9
                  Ah, UCSC gene IDs, those will always give you headaches. It looks like you are mixing multiple versions of the knownGene database. In the most recent one, uc001aab.3 and uc001aah.3 (as an example) are merged together into uc001aah.4, which probably exists in your ucsc id to gene name table. You might just download kg5ToKg6.txt.gz from UCSC and use it to update one of your annotation files (or just switch to Ensembl, their annotations have given me fewer headaches).

                  Comment


                  • #10
                    Ah! I just recently ran the pipeline with the Ensembl files, so I will try that approach and see whether it goes better. Thanks for the tip

                    Would you recommend using gene IDs or switching to summed gene symbols with this kind of research question? My assumption was that since we are not interested in any information at the level of individual isoforms, it would be better to sum them together.

                    Comment


                    • #11
                      There's normally a difference between gene id and transcript id, with genes (sometimes with the same name, but then different gene ids) having multiple transcript IDs. If you download the human GTF annotation from Ensembl, you will find this to be the case. In that case, just use the gene id, since you don't care about particular transcripts.

                      Comment


                      • #12
                        I am using the Ensembl GTF annotation, yes. I'm seeing four things in my GTF:
                        - gene_id
                        - gene_name
                        - transcript_id
                        - transcript_name

                        My question was primarily about ending up with the gene_name instead of the gene_id?

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin


                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                          Today, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        37 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        41 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        35 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        54 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X