Seqanswers Leaderboard Ad

**dpryan** · 10-31-2013, 04:53 AM

There are a couple ways to go about that. Do you need/want to include multimappers in your fpkm/rpkm numbers (it's faster not to)? Do you want to try to estimate the appropriate transcript length or just use a union gene model (the latter is faster)? If you're happy just using the unique alignments (the counts file from htseq-count) and using a single precomputed size for each gene, then you can make the calculations very quickly. Otherwise, you end up needing cufflinks or something similar and a lot of time.

**a_mt** · 10-31-2013, 04:54 AM

Hi feralBiologist,

There's a qucik way to do it:

1. convert your SAM/BAM to wiggle file (you can use bedtools)

2. multiply every value in wiggle by (1,000,000/no. of reads)

for eg: if you have 150 million reads, (1,000,000/150,000,000) would be 0.006666

So multiply every wiggle by 0.00666 !

**feralBiologist** · 10-31-2013, 07:47 AM

Originally posted by dpryan View Post

There are a couple ways to go about that. Do you need/want to include multimappers in your fpkm/rpkm numbers (it's faster not to)? Do you want to try to estimate the appropriate transcript length or just use a union gene model (the latter is faster)? If you're happy just using the unique alignments (the counts file from htseq-count) and using a single precomputed size for each gene, then you can make the calculations very quickly. Otherwise, you end up needing cufflinks or something similar and a lot of time.

I could go without the multimappers and I'm happy using the union model. So in this case I could just take the htseq-count results and just divide by the gene length as contained in the GTF file?

**dpryan** · 10-31-2013, 08:09 AM

Yeah, so the edgeR rpkm() function from biostars would be the easiest then. You can get the gene length in R with the following script. I wrote it originally to do more than you want, so just remove the fasta and %GC specific stuff.

Code:

#!/usr/bin/Rscript
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)

GTFfile = "something.GTF"
FASTAfile = "something.fa"

#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
    nGCs = sum(elementMetadata(x)$nGCs)
    width = sum(elementMetadata(x)$widths)
    c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")

write.table(output, file="GC_lengths.tsv", sep="\t")

Change "something.GTF", obviously. You then have the lengths for rpkm() in edgeR.

**feralBiologist** · 10-31-2013, 10:59 AM

@dpryan, @a_mt: Thanks for your fast replies. I ran the code by dpryan and it works neatly!

**feralBiologist** · 10-31-2013, 11:02 AM

Just for the record, I'll link the reply of Madelaine, too: http://www.biostars.org/p/85148

**dpryan** · 10-31-2013, 11:03 AM

Glad to hear it worked. In the future, please try to just post on one forum and not here and biostars and the bioconductor email list. Most of the places have rules against cross-posting.

**westerman** · 11-01-2013, 08:47 AM

Originally posted by dpryan View Post

Glad to hear it worked. In the future, please try to just post on one forum and not here and biostars and the bioconductor email list. Most of the places have rules against cross-posting.

Aren't the "rules" more about cross-posting within a forum? In other words don't post the same question in different threads on the same forum. I think that posting the same question to different forums/mailing lists is perfectly legitimate since there are likely to be different people reading those forums/lists.

**swbarnes2** · 11-04-2013, 01:10 PM

Originally posted by westerman View Post

Aren't the "rules" more about cross-posting within a forum? In other words don't post the same question in different threads on the same forum. I think that posting the same question to different forums/mailing lists is perfectly legitimate since there are likely to be different people reading those forums/lists.

There's a top 10 list of things not to do on forums like these floating around, and one of those suggestions is to not to post the question on two different sites.

Even if the readership is different, it's kind of a waste for someone there to spend the time answering the question when someone here has already given the asker what s/he wants. I think that's the reasoning.

**padmoo** · 07-29-2015, 01:34 AM

Hi everyone,

I've been trying to get RPKM's too and I get the following error:

rpkm2 <- rpkm(d, gene.length=length, normalized.lib.size=TRUE, log=FALSE)
Warning message:
In y/gene.length.kb :
longer object length is not a multiple of shorter object length

I do get an output of RPKM's, so I am wondering what the error is about. Does anyone know what the problem is? Are there overlaps or something like that?

**dpryan** · 07-29-2015, 01:39 AM

That's a warning, not an error. This is due to you giving a different number of gene lengths than there are genes in "d".

**padmoo** · 07-30-2015, 07:37 AM

Yes, that was it. I didn't pay attention when I filtered out some genes in a previous step.

**super0925** · 09-10-2015, 08:32 AM

Originally posted by dpryan View Post

Yeah, so the edgeR rpkm() function from biostars would be the easiest then. You can get the gene length in R with the following script. I wrote it originally to do more than you want, so just remove the fasta and %GC specific stuff.

Code:

#!/usr/bin/Rscript
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)

GTFfile = "something.GTF"
FASTAfile = "something.fa"

#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
    nGCs = sum(elementMetadata(x)$nGCs)
    width = sum(elementMetadata(x)$widths)
    c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")

write.table(output, file="GC_lengths.tsv", sep="\t")

Change "something.GTF", obviously. You then have the lengths for rpkm() in edgeR.

Hi D

I found your post to calculate the gene length.
I have one BAM file and want to get the RPKM for each genes. I have the genes.gtf and genome.fa in the same directory.
For your script, I only change your command to 'GTF <- import.gff(GTFfile, format="gtf", genome="hg19", asRangedData=F, feature.type="exon")'
Other are same.
after I ran this script I got the error
"Error in letterFrequency(getSeq(FASTA, reducedGTF), "GC") :
error in evaluating the argument 'x' in selecting a method for function 'letterFrequency': Error in value[[3L]](cond) :
record 1666 (chr6_ssto_hap7:1871448-1871615) failed
file: genome.fa
Calls: getSeq ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted
"

Why?
Thank you！

**super0925** · 09-10-2015, 08:52 AM

Originally posted by a_mt View Post

Hi feralBiologist,

There's a qucik way to do it:

1. convert your SAM/BAM to wiggle file (you can use bedtools)

2. multiply every value in wiggle by (1,000,000/no. of reads)

for eg: if you have 150 million reads, (1,000,000/150,000,000) would be 0.006666

So multiply every wiggle by 0.00666 !

Hi a_mt Could you give me the commands? I don't know how to transfer from SAM to wiggle by bedtools.
I now have sample.BAM , genome.fa and genes.gtf in the same directory.
Thank you very much!

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, Today, 07:03 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 31 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 41 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 33 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

The easiest/fastest way to get from BAM to TPM or RPKM

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News