SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Newbie Question: Calculating physical coverage from genome coverage tristanstoeber Illumina/Solexa 4 06-24-2013 10:53 AM
calculating the coverage over a gene kjaja Bioinformatics 3 06-01-2012 06:01 AM
calculating the coverage lyz1030 Bioinformatics 3 04-03-2012 05:03 PM
Script for calculating the average distribution of tags per gene from ChIP-seq?? Giles General 1 09-12-2011 02:25 AM
Gene list for calculating coverage nseh Bioinformatics 1 05-22-2011 06:31 AM

Reply
 
Thread Tools
Old 10-17-2012, 09:12 AM   #1
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Arrow RNA-seq: calculating per-gene coverage

Hello,

I know that this topic has been massively covered in this forum, but I wasn't able to find the right answer for my questions.

Basically, I'm dealing with some bacterial RNA-seq data. I have two biological replicates and two conditions (so four libraries). I used bowtie for alignment of my reads and subsequently DESeq to calculate means/fold change/stats in order to compare my two conditions.

Now, I would like to calculate gene coverage per 1kb (relative) in order to compare absolute expression of every gene. I was thinking that if I take number of reads and I divide it by the gene length and then multiply by 1000 I will get some kind of "gene expression ratio per 1kb" which I could use to compare for all my genes. So, I could say that gene "A" is ten times more expressed then gene "B" just by comparing those ratios. Is this the right way of doing it, or perhaps I'm missing something?

Also, I sought to use BEDTools for this, but I don't know if I could insert multiple files (biological replicates) for my analysis. Anybody know how?

Finaly, anybody know how to extract gene length from a GFF or even GenBank file? BEDTools does it but I have duplicated values (for gene/CDS/exons).

Thanks in advance! I don't know what I would do without this forum...

TP
ThePresident is offline   Reply With Quote
Old 10-17-2012, 04:29 PM   #2
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default

After re-reading a couple of papers having for subject RNA-seq analysis, I'm a little bit confused. Many papers use Gene Expression Index (GEI) and it is calculated as mean coverage depth of the gene normalized by the total number of reads mapped to the non-rRNA regions (Wang 2011, BMC Genomics)

Well, in my case, I've performed HTSeq to get reads/gene count and then DESeq to compare those data for significant genes. And, normally, DESeq already takes into account library depth and perform correction, right? So, I can't do it again?

I might be missing something here... any help would be greatly appreciated!

TP
ThePresident is offline   Reply With Quote
Old 10-17-2012, 04:41 PM   #3
rflrob
Member
 
Location: Berkeley, CA

Join Date: May 2010
Posts: 50
Default

Is there a reason you don't want to use RPKM (reads per kilobase per million reads), such as you'd get out of a program like Cufflinks?

RPKM values are nice, since they correct for different samples arbitrarily being sequenced to different depths (and unless you magically loaded precisely the same amount on the sequencer, and the poisson noise worked exactly to zero, you did sequence them to slightly different depths). On the other hand, the statistics for properly interpreting differential expression from RPKM alone are unconvincing to me, at best, so you'd want to use something like DESeq anyways.

EDIT: another nice advantage is that people are used to seeing RPKM values, and have a sense of how to reason about them, and where they work/don't work, so reviewers will more readily accept them than something you cobbled together.

Last edited by rflrob; 10-17-2012 at 04:44 PM.
rflrob is offline   Reply With Quote
Old 10-17-2012, 05:03 PM   #4
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default

Thanks for your answer.

Well, that's exactly why I went for DESeq: it works with raw data (number of reads that mapped to a specific gene). Since I have only two biological replicates, I was not confident to compare with RPKM.

On the other side, if I just want to get the general idea of gene expression in my samples, I cannot compare raw counts for two genes that have different length and coming from different sequencing lanes (sequencing depth being different). I don't know if I'm being clear but, for example, let's say that a gene A and B have both 100 reads aligned but gene A is 200pb long and gene B 500pb, their absolute expression is not the same even though they have the same coverage (reads/gene).

Next, since data obtained from DESeq (coverage per gene, i.e. number of aligned reads per gene) is already normalized for sequencing depth, I could just take those data and divide it by genes lenght, so I would take into account seq-depth and gene length. Sound logic to me... and almost like RPKM.

Finally, I do agree that calculating RPKM is convenient because everybody is familiar with that and won't stick on it.

What do you think?
ThePresident is offline   Reply With Quote
Old 10-17-2012, 05:46 PM   #5
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 308
Default

The raw data (counts, i.e. DESeq, edgeR, etc) is the way to go for differential expression.

FPKM is the way to go for relative expression within a condition.

You just need to understand the limitations and deficiencies.

For example, map-ability can drastically effect your FPKM value. As can other biases.
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Old 10-17-2012, 06:15 PM   #6
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default

Okay! So, I have to use cufflinks to get RPKM? (I'm dealing with single-end data, so it's RPKM in my case). I'll try that...

And, while we are on that subject, can you (or someone) give me some limitations and deficiencies on RPKM? What you mean by map-ability? The "ability" of reads to map on a reference sequence? Sorry if it sounds basic to you guys, but I just want to be sure I understand all this

TP
ThePresident is offline   Reply With Quote
Old 10-18-2012, 06:15 AM   #7
jparsons
Member
 
Location: SF Bay Area

Join Date: Feb 2012
Posts: 62
Default

I can't say for certain where I got this impression, but I thought that relative expression in RNA-seq was close to un-knowable. (I know there have been papers criticizing RPKM lately*)

There are just too many other factors in play (most notably differential PCR amplification of A vs B, although others exist) to be able to confidently say that 10X as many counts reflects a true difference in relative gene expression.

Even spiked-in controls with known 'truth' values for their relative gene expression have some difficulty comparing A to B**

*Bias detection and correction in RNA-Sequencing data.
*Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples
**Synthetic spike-in standards for RNA-seq experiments.
jparsons is offline   Reply With Quote
Old 10-18-2012, 06:20 AM   #8
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default

Quote:
Originally Posted by jparsons View Post

There are just too many other factors in play (most notably differential PCR amplification of A vs B, although others exist) to be able to confidently say that 10X as many counts reflects a true difference in relative gene expression.
I do agree with this however I think it can give an overall idea... I'm not sure how much can we actually be confident in interpretation but it can give an idea especially if the difference is huge.
ThePresident is offline   Reply With Quote
Old 11-09-2012, 07:02 PM   #9
maize
Junior Member
 
Location: NY

Join Date: Apr 2011
Posts: 9
Default

Hi, does anybody know how to get the result of raw number of mapped reads for each gene?I don't want the FPKM or RPKM value.
maize is offline   Reply With Quote
Old 11-09-2012, 07:18 PM   #10
ETHANol
Senior Member
 
Location: Western Australia

Join Date: Feb 2010
Posts: 308
Default

HTSeq-count does the job. I have some scripts that will take the grunt work out of it, but it takes a little time to set up and there's a reasonable chance it won't work on your platform.
http://ethanomics.wordpress.com/2012...nalysis-suite/
__________________
--------------
Ethan
ETHANol is offline   Reply With Quote
Reply

Tags
bedtools, coverage, rpkm

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:49 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO