Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The easiest/fastest way to get from BAM to TPM or RPKM

    I have already aligned all my RNA-seq samples using tophat2. I've done the differential analysis using htseq-count and edgeR. I would like to get rpkm and tpm metrics. What is the easiest way to do that using the output of tophat2, htseq-count and edgeR? I also have the reference GTF. So, please, recommend a tool or a workflow that would take me to TPKM or RPKM in less than a day - I have a machine with 32GB RAM and 8 cores and my study comprises 24 samples.

  • #2
    There are a couple ways to go about that. Do you need/want to include multimappers in your fpkm/rpkm numbers (it's faster not to)? Do you want to try to estimate the appropriate transcript length or just use a union gene model (the latter is faster)? If you're happy just using the unique alignments (the counts file from htseq-count) and using a single precomputed size for each gene, then you can make the calculations very quickly. Otherwise, you end up needing cufflinks or something similar and a lot of time.

    Comment


    • #3
      Hi feralBiologist,

      There's a qucik way to do it:

      1. convert your SAM/BAM to wiggle file (you can use bedtools)

      2. multiply every value in wiggle by (1,000,000/no. of reads)

      for eg: if you have 150 million reads, (1,000,000/150,000,000) would be 0.006666

      So multiply every wiggle by 0.00666 !

      Comment


      • #4
        Originally posted by dpryan View Post
        There are a couple ways to go about that. Do you need/want to include multimappers in your fpkm/rpkm numbers (it's faster not to)? Do you want to try to estimate the appropriate transcript length or just use a union gene model (the latter is faster)? If you're happy just using the unique alignments (the counts file from htseq-count) and using a single precomputed size for each gene, then you can make the calculations very quickly. Otherwise, you end up needing cufflinks or something similar and a lot of time.
        I could go without the multimappers and I'm happy using the union model. So in this case I could just take the htseq-count results and just divide by the gene length as contained in the GTF file?

        Comment


        • #5
          Yeah, so the edgeR rpkm() function from biostars would be the easiest then. You can get the gene length in R with the following script. I wrote it originally to do more than you want, so just remove the fasta and %GC specific stuff.
          Code:
          #!/usr/bin/Rscript
          library(GenomicRanges)
          library(rtracklayer)
          library(Rsamtools)
          
          GTFfile = "something.GTF"
          FASTAfile = "something.fa"
          
          #Load the annotation and reduce it
          GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
          grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
          reducedGTF <- unlist(grl, use.names=T)
          elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
          
          #Open the fasta file
          FASTA <- FaFile(FASTAfile)
          open(FASTA)
          
          #Add the GC numbers
          elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
          elementMetadata(reducedGTF)$widths <- width(reducedGTF)
          
          #Create a list of the ensembl_id/GC/length
          calc_GC_length <- function(x) {
              nGCs = sum(elementMetadata(x)$nGCs)
              width = sum(elementMetadata(x)$widths)
              c(width, nGCs/width)
          }
          output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
          colnames(output) <- c("Length", "GC")
          
          write.table(output, file="GC_lengths.tsv", sep="\t")
          Change "something.GTF", obviously. You then have the lengths for rpkm() in edgeR.

          Comment


          • #6
            @dpryan, @a_mt: Thanks for your fast replies. I ran the code by dpryan and it works neatly!

            Comment


            • #7
              Just for the record, I'll link the reply of Madelaine, too: http://www.biostars.org/p/85148

              Comment


              • #8
                Glad to hear it worked. In the future, please try to just post on one forum and not here and biostars and the bioconductor email list. Most of the places have rules against cross-posting.

                Comment


                • #9
                  Originally posted by dpryan View Post
                  Glad to hear it worked. In the future, please try to just post on one forum and not here and biostars and the bioconductor email list. Most of the places have rules against cross-posting.
                  Aren't the "rules" more about cross-posting within a forum? In other words don't post the same question in different threads on the same forum. I think that posting the same question to different forums/mailing lists is perfectly legitimate since there are likely to be different people reading those forums/lists.

                  Comment


                  • #10
                    Originally posted by westerman View Post
                    Aren't the "rules" more about cross-posting within a forum? In other words don't post the same question in different threads on the same forum. I think that posting the same question to different forums/mailing lists is perfectly legitimate since there are likely to be different people reading those forums/lists.
                    There's a top 10 list of things not to do on forums like these floating around, and one of those suggestions is to not to post the question on two different sites.

                    Even if the readership is different, it's kind of a waste for someone there to spend the time answering the question when someone here has already given the asker what s/he wants. I think that's the reasoning.

                    Comment


                    • #11
                      Hi everyone,

                      I've been trying to get RPKM's too and I get the following error:

                      rpkm2 <- rpkm(d, gene.length=length, normalized.lib.size=TRUE, log=FALSE)
                      Warning message:
                      In y/gene.length.kb :
                      longer object length is not a multiple of shorter object length

                      I do get an output of RPKM's, so I am wondering what the error is about. Does anyone know what the problem is? Are there overlaps or something like that?

                      Comment


                      • #12
                        That's a warning, not an error. This is due to you giving a different number of gene lengths than there are genes in "d".

                        Comment


                        • #13
                          Yes, that was it. I didn't pay attention when I filtered out some genes in a previous step.

                          Comment


                          • #14
                            Originally posted by dpryan View Post
                            Yeah, so the edgeR rpkm() function from biostars would be the easiest then. You can get the gene length in R with the following script. I wrote it originally to do more than you want, so just remove the fasta and %GC specific stuff.
                            Code:
                            #!/usr/bin/Rscript
                            library(GenomicRanges)
                            library(rtracklayer)
                            library(Rsamtools)
                            
                            GTFfile = "something.GTF"
                            FASTAfile = "something.fa"
                            
                            #Load the annotation and reduce it
                            GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
                            grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
                            reducedGTF <- unlist(grl, use.names=T)
                            elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
                            
                            #Open the fasta file
                            FASTA <- FaFile(FASTAfile)
                            open(FASTA)
                            
                            #Add the GC numbers
                            elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
                            elementMetadata(reducedGTF)$widths <- width(reducedGTF)
                            
                            #Create a list of the ensembl_id/GC/length
                            calc_GC_length <- function(x) {
                                nGCs = sum(elementMetadata(x)$nGCs)
                                width = sum(elementMetadata(x)$widths)
                                c(width, nGCs/width)
                            }
                            output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
                            colnames(output) <- c("Length", "GC")
                            
                            write.table(output, file="GC_lengths.tsv", sep="\t")
                            Change "something.GTF", obviously. You then have the lengths for rpkm() in edgeR.

                            Hi D

                            I found your post to calculate the gene length.
                            I have one BAM file and want to get the RPKM for each genes. I have the genes.gtf and genome.fa in the same directory.
                            For your script, I only change your command to 'GTF <- import.gff(GTFfile, format="gtf", genome="hg19", asRangedData=F, feature.type="exon")'
                            Other are same.
                            after I ran this script I got the error
                            "Error in letterFrequency(getSeq(FASTA, reducedGTF), "GC") :
                            error in evaluating the argument 'x' in selecting a method for function 'letterFrequency': Error in value[[3L]](cond) :
                            record 1666 (chr6_ssto_hap7:1871448-1871615) failed
                            file: genome.fa
                            Calls: getSeq ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
                            Execution halted
                            "

                            Why?
                            Thank you!
                            Last edited by super0925; 09-10-2015, 08:45 AM.

                            Comment


                            • #15
                              Originally posted by a_mt View Post
                              Hi feralBiologist,

                              There's a qucik way to do it:

                              1. convert your SAM/BAM to wiggle file (you can use bedtools)

                              2. multiply every value in wiggle by (1,000,000/no. of reads)

                              for eg: if you have 150 million reads, (1,000,000/150,000,000) would be 0.006666

                              So multiply every wiggle by 0.00666 !
                              Hi a_mt Could you give me the commands? I don't know how to transfer from SAM to wiggle by bedtools.
                              I now have sample.BAM , genome.fa and genes.gtf in the same directory.
                              Thank you very much!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X