Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to calculate transcript length?

    Dear all,

    Do you know how to calculate transcript length for calculating RPKM?
    For each UCSC gene (transcript), is its length equal to the sum of the length of its extrons?

    Thanks,

    Feng

  • #2
    That is correct, based on my understanding. Someone please correct me if I'm wrong.

    Tophat has the gffread utility which can take a .gtf and a genome reference and extract all transcripts into a new fasta, which you could then use to calculate the length for each. Using samtools faidx you could quickly get the lengths for each (columns 1 and 2).

    Comment


    • #3
      andylemire is correct. Another method for calculating lengths would be to just feed your annotation file to R:

      Code:
      #!/usr/bin/env Rscript
      library(GenomicRanges)
      library(rtracklayer)
      GTFfile = "some_annotation_file.GTF"
      GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
      grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
      reducedGTF <- unlist(grl, use.names=T)
      elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
      elementMetadata(reducedGTF)$widths <- width(reducedGTF)
      calc_length <- function(x) {
          sum(elementMetadata(x)$widths)
      }
      output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
      colnames(output) <- c("Length")
      #You can now do whatever you want with "output"
      The original version of the script above also calculated GC content (sometimes useful for RNAseq normalization).

      Comment


      • #4
        Dear all,
        Than you for your reply very much.

        By the way, how do you map UCSC gene ID to entrez ID? Which UCSC table should I use?

        Can all the UCSC genes (transcripts) be mapped to entrez gene?

        Comment


        • #5
          FYI, your life will be easier if you stick to Ensembl annotations, they're better in pretty much all respects.

          Regarding your actual question, have a look at the knownToLocusLink table. LocusLink is (at least normally) the same as the entrez ID, since locus link was the predecessor to entrez. You might also find the various kgXX (known gene, aka UCSC ID based) tables and the knownToXXX (known gene to something else conversion) useful.

          Comment


          • #6
            Hi dpryan,

            Thanks for your reply. The knownToLocusLink has the mapping from UCSC genes to Entrez ID. However, about 20% UCSC genes have no related Entrez ID. For these un-mapped UCSC, should I just ignore them when estimating gene expression from isoform expression?

            Comment


            • #7
              Why are you trying to estimate gene expression from isoform expression instead of just counting over genes? That latter would seem much easier. For the 20% with no Entrez ID, have a look at what a few of them are. I recall most of them being Riken genes or pseudogenes (i.e., not interesting for you), but it's been quite a while...

              Comment


              • #8
                Do you mean sum the read counts which are aligned to gene?
                For my data, I only have the count number of isoforms.

                Comment


                • #9
                  Yes, summing the read count per-gene is the normal method. You should have the raw alignments, so just use featureCounts or htseq-count to get the per-gene metrics.

                  Comment


                  • #10
                    Get it. Thanks!

                    Comment


                    • #11
                      Dear dpryan,

                      I wrote a script similar to your script, to calculate the size of ensembl genes. However, I have a question for you. How you handle transcripts that overlap ?
                      Indeed, by not taking into account 'transcripts that overlap' in my script, I wondered if I did not overestimated the length of genes.
                      In your opinion, is it more just to find these transcripts and count the region once rather than twice (or more)?

                      Thank you in advance for your answers.

                      (PS: Sorry if my english is bad, it is not my native language)

                      Comment


                      • #12
                        Hi velt,

                        The relevant portion of my script is the "reduce()" function. In GenomicRanges in R, this merges together any overlapping exons. So, if you have two overlapping transcripts, the bases in the overlap region are only counted once (as opposed to being double counted). As you mentioned, not doing this will lead to heavily spliced transcripts appearing much much longer than they actually are.

                        Hope that helps (and no worries about your English, it's quite good and vastly better than my French)!

                        Comment


                        • #13
                          Thank you very much for your reply ! I will proceed in this manner to treat transcripts that overlap.

                          Have a nice day.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          59 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          57 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          56 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X