I have already aligned all my RNA-seq samples using tophat2. I've done the differential analysis using htseq-count and edgeR. I would like to get rpkm and tpm metrics. What is the easiest way to do that using the output of tophat2, htseq-count and edgeR? I also have the reference GTF. So, please, recommend a tool or a workflow that would take me to TPKM or RPKM in less than a day - I have a machine with 32GB RAM and 8 cores and my study comprises 24 samples.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
There are a couple ways to go about that. Do you need/want to include multimappers in your fpkm/rpkm numbers (it's faster not to)? Do you want to try to estimate the appropriate transcript length or just use a union gene model (the latter is faster)? If you're happy just using the unique alignments (the counts file from htseq-count) and using a single precomputed size for each gene, then you can make the calculations very quickly. Otherwise, you end up needing cufflinks or something similar and a lot of time.
-
Hi feralBiologist,
There's a qucik way to do it:
1. convert your SAM/BAM to wiggle file (you can use bedtools)
2. multiply every value in wiggle by (1,000,000/no. of reads)
for eg: if you have 150 million reads, (1,000,000/150,000,000) would be 0.006666
So multiply every wiggle by 0.00666 !
Comment
-
Originally posted by dpryan View PostThere are a couple ways to go about that. Do you need/want to include multimappers in your fpkm/rpkm numbers (it's faster not to)? Do you want to try to estimate the appropriate transcript length or just use a union gene model (the latter is faster)? If you're happy just using the unique alignments (the counts file from htseq-count) and using a single precomputed size for each gene, then you can make the calculations very quickly. Otherwise, you end up needing cufflinks or something similar and a lot of time.
Comment
-
Yeah, so the edgeR rpkm() function from biostars would be the easiest then. You can get the gene length in R with the following script. I wrote it originally to do more than you want, so just remove the fasta and %GC specific stuff.
Code:#!/usr/bin/Rscript library(GenomicRanges) library(rtracklayer) library(Rsamtools) GTFfile = "something.GTF" FASTAfile = "something.fa" #Load the annotation and reduce it GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon") grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id)) reducedGTF <- unlist(grl, use.names=T) elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl)) #Open the fasta file FASTA <- FaFile(FASTAfile) open(FASTA) #Add the GC numbers elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1] elementMetadata(reducedGTF)$widths <- width(reducedGTF) #Create a list of the ensembl_id/GC/length calc_GC_length <- function(x) { nGCs = sum(elementMetadata(x)$nGCs) width = sum(elementMetadata(x)$widths) c(width, nGCs/width) } output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length)) colnames(output) <- c("Length", "GC") write.table(output, file="GC_lengths.tsv", sep="\t")
Comment
-
Originally posted by dpryan View PostGlad to hear it worked. In the future, please try to just post on one forum and not here and biostars and the bioconductor email list. Most of the places have rules against cross-posting.
Comment
-
Originally posted by westerman View PostAren't the "rules" more about cross-posting within a forum? In other words don't post the same question in different threads on the same forum. I think that posting the same question to different forums/mailing lists is perfectly legitimate since there are likely to be different people reading those forums/lists.
Even if the readership is different, it's kind of a waste for someone there to spend the time answering the question when someone here has already given the asker what s/he wants. I think that's the reasoning.
Comment
-
Hi everyone,
I've been trying to get RPKM's too and I get the following error:
rpkm2 <- rpkm(d, gene.length=length, normalized.lib.size=TRUE, log=FALSE)
Warning message:
In y/gene.length.kb :
longer object length is not a multiple of shorter object length
I do get an output of RPKM's, so I am wondering what the error is about. Does anyone know what the problem is? Are there overlaps or something like that?
Comment
-
Originally posted by dpryan View PostYeah, so the edgeR rpkm() function from biostars would be the easiest then. You can get the gene length in R with the following script. I wrote it originally to do more than you want, so just remove the fasta and %GC specific stuff.
Code:#!/usr/bin/Rscript library(GenomicRanges) library(rtracklayer) library(Rsamtools) GTFfile = "something.GTF" FASTAfile = "something.fa" #Load the annotation and reduce it GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon") grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id)) reducedGTF <- unlist(grl, use.names=T) elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl)) #Open the fasta file FASTA <- FaFile(FASTAfile) open(FASTA) #Add the GC numbers elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1] elementMetadata(reducedGTF)$widths <- width(reducedGTF) #Create a list of the ensembl_id/GC/length calc_GC_length <- function(x) { nGCs = sum(elementMetadata(x)$nGCs) width = sum(elementMetadata(x)$widths) c(width, nGCs/width) } output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length)) colnames(output) <- c("Length", "GC") write.table(output, file="GC_lengths.tsv", sep="\t")
Hi D
I found your post to calculate the gene length.
I have one BAM file and want to get the RPKM for each genes. I have the genes.gtf and genome.fa in the same directory.
For your script, I only change your command to 'GTF <- import.gff(GTFfile, format="gtf", genome="hg19", asRangedData=F, feature.type="exon")'
Other are same.
after I ran this script I got the error
"Error in letterFrequency(getSeq(FASTA, reducedGTF), "GC") :
error in evaluating the argument 'x' in selecting a method for function 'letterFrequency': Error in value[[3L]](cond) :
record 1666 (chr6_ssto_hap7:1871448-1871615) failed
file: genome.fa
Calls: getSeq ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted
"
Why?
Thank you!Last edited by super0925; 09-10-2015, 08:45 AM.
Comment
-
Originally posted by a_mt View PostHi feralBiologist,
There's a qucik way to do it:
1. convert your SAM/BAM to wiggle file (you can use bedtools)
2. multiply every value in wiggle by (1,000,000/no. of reads)
for eg: if you have 150 million reads, (1,000,000/150,000,000) would be 0.006666
So multiply every wiggle by 0.00666 !
I now have sample.BAM , genome.fa and genes.gtf in the same directory.
Thank you very much!
Comment
Latest Articles
Collapse
-
by seqadmin
The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...-
Channel: Articles
05-06-2024, 07:48 AM -
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 07:03 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Today, 07:03 AM
|
||
Started by seqadmin, 05-10-2024, 06:35 AM
|
0 responses
31 views
0 likes
|
Last Post
by seqadmin
05-10-2024, 06:35 AM
|
||
Started by seqadmin, 05-09-2024, 02:46 PM
|
0 responses
41 views
0 likes
|
Last Post
by seqadmin
05-09-2024, 02:46 PM
|
||
Started by seqadmin, 05-07-2024, 06:57 AM
|
0 responses
33 views
0 likes
|
Last Post
by seqadmin
05-07-2024, 06:57 AM
|
Comment