SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cutoffs for TPM values mht Bioinformatics 1 03-16-2015 10:46 AM
fastest solution to get bam2fastq done memento Bioinformatics 34 05-31-2014 04:51 AM
Easiest way to get number of reads per contig? kmkocot Bioinformatics 2 05-14-2013 05:53 PM
Fastest way to extract differing positions from each alignment in a BAM file CHRYSES Bioinformatics 5 12-14-2011 12:28 PM
tophat/cufflinks bam vs. RPKM mgogol Bioinformatics 5 04-26-2010 10:58 AM

Reply
 
Thread Tools
Old 10-31-2013, 05:44 AM   #1
feralBiologist
Member
 
Location: UK

Join Date: Jun 2011
Posts: 61
Default The easiest/fastest way to get from BAM to TPM or RPKM

I have already aligned all my RNA-seq samples using tophat2. I've done the differential analysis using htseq-count and edgeR. I would like to get rpkm and tpm metrics. What is the easiest way to do that using the output of tophat2, htseq-count and edgeR? I also have the reference GTF. So, please, recommend a tool or a workflow that would take me to TPKM or RPKM in less than a day - I have a machine with 32GB RAM and 8 cores and my study comprises 24 samples.
feralBiologist is offline   Reply With Quote
Old 10-31-2013, 05:53 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

There are a couple ways to go about that. Do you need/want to include multimappers in your fpkm/rpkm numbers (it's faster not to)? Do you want to try to estimate the appropriate transcript length or just use a union gene model (the latter is faster)? If you're happy just using the unique alignments (the counts file from htseq-count) and using a single precomputed size for each gene, then you can make the calculations very quickly. Otherwise, you end up needing cufflinks or something similar and a lot of time.
dpryan is offline   Reply With Quote
Old 10-31-2013, 05:54 AM   #3
a_mt
Member
 
Location: C:/Program files/Google/Chrome

Join Date: Jul 2012
Posts: 34
Default

Hi feralBiologist,

There's a qucik way to do it:

1. convert your SAM/BAM to wiggle file (you can use bedtools)

2. multiply every value in wiggle by (1,000,000/no. of reads)

for eg: if you have 150 million reads, (1,000,000/150,000,000) would be 0.006666

So multiply every wiggle by 0.00666 !
a_mt is offline   Reply With Quote
Old 10-31-2013, 08:47 AM   #4
feralBiologist
Member
 
Location: UK

Join Date: Jun 2011
Posts: 61
Default

Quote:
Originally Posted by dpryan View Post
There are a couple ways to go about that. Do you need/want to include multimappers in your fpkm/rpkm numbers (it's faster not to)? Do you want to try to estimate the appropriate transcript length or just use a union gene model (the latter is faster)? If you're happy just using the unique alignments (the counts file from htseq-count) and using a single precomputed size for each gene, then you can make the calculations very quickly. Otherwise, you end up needing cufflinks or something similar and a lot of time.
I could go without the multimappers and I'm happy using the union model. So in this case I could just take the htseq-count results and just divide by the gene length as contained in the GTF file?
feralBiologist is offline   Reply With Quote
Old 10-31-2013, 09:09 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

Yeah, so the edgeR rpkm() function from biostars would be the easiest then. You can get the gene length in R with the following script. I wrote it originally to do more than you want, so just remove the fasta and %GC specific stuff.
Code:
#!/usr/bin/Rscript
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)

GTFfile = "something.GTF"
FASTAfile = "something.fa"

#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
    nGCs = sum(elementMetadata(x)$nGCs)
    width = sum(elementMetadata(x)$widths)
    c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")

write.table(output, file="GC_lengths.tsv", sep="\t")
Change "something.GTF", obviously. You then have the lengths for rpkm() in edgeR.
dpryan is offline   Reply With Quote
Old 10-31-2013, 11:59 AM   #6
feralBiologist
Member
 
Location: UK

Join Date: Jun 2011
Posts: 61
Default

@dpryan, @a_mt: Thanks for your fast replies. I ran the code by dpryan and it works neatly!
feralBiologist is offline   Reply With Quote
Old 10-31-2013, 12:02 PM   #7
feralBiologist
Member
 
Location: UK

Join Date: Jun 2011
Posts: 61
Default

Just for the record, I'll link the reply of Madelaine, too: http://www.biostars.org/p/85148
feralBiologist is offline   Reply With Quote
Old 10-31-2013, 12:03 PM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

Glad to hear it worked. In the future, please try to just post on one forum and not here and biostars and the bioconductor email list. Most of the places have rules against cross-posting.
dpryan is offline   Reply With Quote
Old 11-01-2013, 09:47 AM   #9
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by dpryan View Post
Glad to hear it worked. In the future, please try to just post on one forum and not here and biostars and the bioconductor email list. Most of the places have rules against cross-posting.
Aren't the "rules" more about cross-posting within a forum? In other words don't post the same question in different threads on the same forum. I think that posting the same question to different forums/mailing lists is perfectly legitimate since there are likely to be different people reading those forums/lists.
westerman is offline   Reply With Quote
Old 11-04-2013, 01:10 PM   #10
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by westerman View Post
Aren't the "rules" more about cross-posting within a forum? In other words don't post the same question in different threads on the same forum. I think that posting the same question to different forums/mailing lists is perfectly legitimate since there are likely to be different people reading those forums/lists.
There's a top 10 list of things not to do on forums like these floating around, and one of those suggestions is to not to post the question on two different sites.

Even if the readership is different, it's kind of a waste for someone there to spend the time answering the question when someone here has already given the asker what s/he wants. I think that's the reasoning.
swbarnes2 is offline   Reply With Quote
Old 07-29-2015, 02:34 AM   #11
padmoo
Member
 
Location: Norwich

Join Date: Jun 2015
Posts: 16
Default

Hi everyone,

I've been trying to get RPKM's too and I get the following error:

rpkm2 <- rpkm(d, gene.length=length, normalized.lib.size=TRUE, log=FALSE)
Warning message:
In y/gene.length.kb :
longer object length is not a multiple of shorter object length

I do get an output of RPKM's, so I am wondering what the error is about. Does anyone know what the problem is? Are there overlaps or something like that?
padmoo is offline   Reply With Quote
Old 07-29-2015, 02:39 AM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

That's a warning, not an error. This is due to you giving a different number of gene lengths than there are genes in "d".
dpryan is offline   Reply With Quote
Old 07-30-2015, 08:37 AM   #13
padmoo
Member
 
Location: Norwich

Join Date: Jun 2015
Posts: 16
Default

Yes, that was it. I didn't pay attention when I filtered out some genes in a previous step.
padmoo is offline   Reply With Quote
Old 09-10-2015, 09:32 AM   #14
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 205
Default

Quote:
Originally Posted by dpryan View Post
Yeah, so the edgeR rpkm() function from biostars would be the easiest then. You can get the gene length in R with the following script. I wrote it originally to do more than you want, so just remove the fasta and %GC specific stuff.
Code:
#!/usr/bin/Rscript
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)

GTFfile = "something.GTF"
FASTAfile = "something.fa"

#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))

#Open the fasta file
FASTA <- FaFile(FASTAfile)
open(FASTA)

#Add the GC numbers
elementMetadata(reducedGTF)$nGCs <- letterFrequency(getSeq(FASTA, reducedGTF), "GC")[,1]
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC/length
calc_GC_length <- function(x) {
    nGCs = sum(elementMetadata(x)$nGCs)
    width = sum(elementMetadata(x)$widths)
    c(width, nGCs/width)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length))
colnames(output) <- c("Length", "GC")

write.table(output, file="GC_lengths.tsv", sep="\t")
Change "something.GTF", obviously. You then have the lengths for rpkm() in edgeR.

Hi D

I found your post to calculate the gene length.
I have one BAM file and want to get the RPKM for each genes. I have the genes.gtf and genome.fa in the same directory.
For your script, I only change your command to 'GTF <- import.gff(GTFfile, format="gtf", genome="hg19", asRangedData=F, feature.type="exon")'
Other are same.
after I ran this script I got the error
"Error in letterFrequency(getSeq(FASTA, reducedGTF), "GC") :
error in evaluating the argument 'x' in selecting a method for function 'letterFrequency': Error in value[[3L]](cond) :
record 1666 (chr6_ssto_hap7:1871448-1871615) failed
file: genome.fa
Calls: getSeq ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted
"

Why?
Thank you!

Last edited by super0925; 09-10-2015 at 09:45 AM.
super0925 is offline   Reply With Quote
Old 09-10-2015, 09:52 AM   #15
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 205
Default

Quote:
Originally Posted by a_mt View Post
Hi feralBiologist,

There's a qucik way to do it:

1. convert your SAM/BAM to wiggle file (you can use bedtools)

2. multiply every value in wiggle by (1,000,000/no. of reads)

for eg: if you have 150 million reads, (1,000,000/150,000,000) would be 0.006666

So multiply every wiggle by 0.00666 !
Hi a_mt Could you give me the commands? I don't know how to transfer from SAM to wiggle by bedtools.
I now have sample.BAM , genome.fa and genes.gtf in the same directory.
Thank you very much!
super0925 is offline   Reply With Quote
Old 09-10-2015, 02:11 PM   #16
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

Quote:
Originally Posted by super0925 View Post
"Error in letterFrequency(getSeq(FASTA, reducedGTF), "GC") :
error in evaluating the argument 'x' in selecting a method for function 'letterFrequency': Error in value[[3L]](cond) :
record 1666 (chr6_ssto_hap7:1871448-1871615) failed
file: genome.fa
Calls: getSeq ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted
"
GTF files match the "top assembly", rather than the "primary assembly". Either download the matching fasta file or remove the haplotype patches from your GTF.
dpryan is offline   Reply With Quote
Old 09-10-2015, 09:30 PM   #17
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 205
Default

Quote:
Originally Posted by dpryan View Post
GTF files match the "top assembly", rather than the "primary assembly". Either download the matching fasta file or remove the haplotype patches from your GTF.
Sorry D, I didn't totally get what you said.
The genes.gtf and genome.fa are downloaded from the latest UCSC homo sapiens hg19 genome.
How could I 'download the matching fasta' or 'remove the haplotype patches' to use your script?
Thank you!

PS: I only change your script to
GTF <- import.gff(GTFfile, format="gtf", genome="hg19", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
super0925 is offline   Reply With Quote
Old 09-11-2015, 01:39 AM   #18
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

If you just need the lengths then remove the code that calculates GC content. Something like the following should work:

Code:
#!/usr/bin/Rscript
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)

GTFfile = "something.GTF"

#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))

elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC
calc_GC_length <- function(x) {
    sum(elementMetadata(x)$widths)
}
output <- sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length)
dpryan is offline   Reply With Quote
Old 09-11-2015, 07:23 AM   #19
super0925
Senior Member
 
Location: UK

Join Date: Feb 2014
Posts: 205
Default

Quote:
Originally Posted by dpryan View Post
If you just need the lengths then remove the code that calculates GC content. Something like the following should work:

Code:
#!/usr/bin/Rscript
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)

GTFfile = "something.GTF"

#Load the annotation and reduce it
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))

elementMetadata(reducedGTF)$widths <- width(reducedGTF)

#Create a list of the ensembl_id/GC
calc_GC_length <- function(x) {
    sum(elementMetadata(x)$widths)
}
output <- sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_GC_length)

Hi D
Brilliant! It works.
I have just changed one command
GTF <- import.gff(GTFfile, format="gtf", genome="hg19", asRangedData=F, feature.type="exon")
However, the result only contain 25369 genes.
However, there are 23170 rows in the count table generated by HTseq, which means there are 23170 genes in the human genome.
Why does the dimension are different? I don't think rpkm() could go further if the dimension are different...
Thank you!
super0925 is offline   Reply With Quote
Old 09-11-2015, 12:27 PM   #20
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,476
Default

There are some entries that htseq-count will skip (those without exons?), which presumably is why this happens. Anyway, you need to ensure that everything is in the same order anyway, so make sure to write meaningful row names and use those with match().
dpryan is offline   Reply With Quote
Reply

Tags
bam, htseq-count, rna-seq, rpkm, tpm

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO