SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Survey: RNA-Seq analysis for Differential Gene/Transcript Expression bodhisattvax Bioinformatics 14 06-12-2013 10:06 AM
differential gene expression analysis between Different strains by RNA-seq qqtwee Bioinformatics 3 07-30-2012 05:19 AM
RNA-Seq Quantification and Differential Expression Analysis days369 RNA Sequencing 2 04-06-2011 01:24 AM

Reply
 
Thread Tools
Old 07-27-2013, 12:35 AM   #1
chenjy
Member
 
Location: China

Join Date: Oct 2010
Posts: 22
Default Expression quantification/differential expression gene analysis by RNA-Seq

Hey. RNA-Seq is very efficient to estimate gene expression level in terms of RPKM/FPKM and to perform differential expression analysis then. which of the following method do you prefer to calculate the RPKM/FPKM value/perform DEG analysis especially when a single gene has multiple isoforms?

RPKM calculation
a). Reads falling on any exon of the gene are counted and then normalized by the total length of all exons and library size;
b). Only reads falling on the constitutive exons are counted and normalized by the length of constitutive exons and library size;
c). Reads falling on the epresentative isoform (eg. longest isoform) of a gene are counted and normalized by isoform length and library size
d). by ERANGE software (http://www.nature.com/nmeth/journal/...meth.1226.html)
e). by cufflinks/cuffdiff software, which made effort to locate reads in isoform/transcrip level.

Differential Expression Gene Analysis
a) DEGseq, b) EdgeR, c), Cuffdiff

By the way, what's your comments on Cufflinks/Cuffidff software? Actually, I heard of both positive and critical opinions, which makes me coufused on whether it is reliable.
chenjy is offline   Reply With Quote
Old 07-27-2013, 02:39 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,479
Default

DESeq(2), edgeR, or limma/voom are your best bets, so you can ignore the RPKM calculation issue. Regarding cufflinks/cuffdiff, the earlier versions were pretty suboptimal, though the more recent versions seem vastly improved.
dpryan is offline   Reply With Quote
Old 07-28-2013, 04:25 AM   #3
chenjy
Member
 
Location: China

Join Date: Oct 2010
Posts: 22
Default

Quote:
Originally Posted by dpryan View Post
DESeq(2), edgeR, or limma/voom are your best bets, so you can ignore the RPKM calculation issue. Regarding cufflinks/cuffdiff, the earlier versions were pretty suboptimal, though the more recent versions seem vastly improved.
Thanks for you reply!

Can you explain why previous versions of cufflinks/cuffdiff were suboptimal. Is it because cufflinks/cuffdiff works on transcript level, while it is not possible to accurately define all the transcripts based on RNA-Seq? or because the underlying statistic model is not good enough?
chenjy is offline   Reply With Quote
Old 07-28-2013, 07:25 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,479
Default

It seems to have been more of a change in implementation, though I can't say for sure. Early on, different point versions were giving very different results, some of which were non-sensical. From reading the forums here, it seems that the most recent versions are more stable and use a somewhat different approach, though I'm not an expert on cufflinks/cuffdiff.
dpryan is offline   Reply With Quote
Old 07-30-2013, 08:38 AM   #5
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

You should read the paper "Computational methods for transcriptome annotation and quantification using RNA-seq" in Nature Methods.

In that paper the cuffdiff authors argue why transcript expression method is conceptually better than the exon union/intersection method.
NGSfan is offline   Reply With Quote
Old 07-30-2013, 07:44 PM   #6
chenjy
Member
 
Location: China

Join Date: Oct 2010
Posts: 22
Default

Quote:
Originally Posted by NGSfan View Post
You should read the paper "Computational methods for transcriptome annotation and quantification using RNA-seq" in Nature Methods.

In that paper the cuffdiff authors argue why transcript expression method is conceptually better than the exon union/intersection method.
Actually, the recent publication on Nature Biotechnology titled "Differential analysis of gene regulation at transcript resolution with RNA-seq" explicitly pointed out the advantage of transcript-level method over union/intersection method. However, Transcript-level expression quantification requires accurately allocating each read to specific transcript. I am not sure whether cuffdiff performs well regarding this. Is it possible to know the exact transcript where the read come from on the basis of RNA-Seq with short reads and uneven distribution across the whole transcript?

Last edited by chenjy; 07-31-2013 at 08:59 AM.
chenjy is offline   Reply With Quote
Old 07-30-2013, 07:56 PM   #7
chenjy
Member
 
Location: China

Join Date: Oct 2010
Posts: 22
Default

Quote:
Originally Posted by dpryan View Post
It seems to have been more of a change in implementation, though I can't say for sure. Early on, different point versions were giving very different results, some of which were non-sensical. From reading the forums here, it seems that the most recent versions are more stable and use a somewhat different approach, though I'm not an expert on cufflinks/cuffdiff.
I see. and the new version of Cuffdiff may require further evaluations.
chenjy is offline   Reply With Quote
Old 07-31-2013, 08:48 AM   #8
jparsons
Member
 
Location: SF Bay Area

Join Date: Feb 2012
Posts: 62
Default

Quote:
Originally Posted by chenjy View Post
Actually, the recent publication on Nature Biotechnology titled "Differential analysis of gene regulation at transcript resolution with RNA-seq" explicitly pointed out the advantage of transcript-level method over union/intersection method. However, Transcript-level expression quantification requires accurate allocate each read to specific transcript. I am not sure whether cuffdiff performs well regarding this. Is it possible to know the exact transcript where the read come from on the basis of RNA-Seq with short reads and uneven distribution across the whole transcript?
The paper only provides a theoretical explanation for an advantage: There's no information whatsoever regarding whether or not those type of reads actually exist in data. While I assume they occur, I find it difficult to assume that they occur in greater proportion than the error rate of transcript-level read assignment, especially since i've seen replicate data (from *technical* reps of the exact same library, no less) where cufflinks assigns 100% of reads to transcript X in overlapping transcripts X/Y in rep A but 90% of them to transcript Y in rep B.
jparsons is offline   Reply With Quote
Old 07-31-2013, 09:08 AM   #9
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by jparsons View Post
The paper only provides a theoretical explanation for an advantage: There's no information whatsoever regarding whether or not those type of reads actually exist in data. While I assume they occur, I find it difficult to assume that they occur in greater proportion than the error rate of transcript-level read assignment, especially since i've seen replicate data (from *technical* reps of the exact same library, no less) where cufflinks assigns 100% of reads to transcript X in overlapping transcripts X/Y in rep A but 90% of them to transcript Y in rep B.
Thanks for sharing this! What were the RPKMs of these transcripts?

RNA-seq suffers from random sampling so if the overall gene expression level is low, then the chance is higher that reads will get assigned to another transcript even in technical replicates. Eg. if you only have 2-4 supporting reads for a splice junction, and the next time you sample it, you get 0 supporting reads for that same junction, I can imagine the reads being flipped to another transcript in the other technical replicate.

What was the FPKM status of this transcript? If it is "OK" then that would be troubling...
NGSfan is offline   Reply With Quote
Old 07-31-2013, 09:43 AM   #10
jparsons
Member
 
Location: SF Bay Area

Join Date: Feb 2012
Posts: 62
Default

The two transcripts happen to be a regular transcript and its accompanying snoRNA ; it's certainly understandable that the reads are misassigned. At least it's not being called DE, but when the FPKM varies between 0 and 20 (or 20 and 40) it casts considerable doubt on the accuracy of the assignment. There are over 1000 reads in the 12kb region.


Code:
SNORA5A	SNORA5A	-	chr7:45139698-45151317	1a	1aU	OK	42.1291	0	-1.79769e+308	-1.79769e+308	0.402342	1	no
TBRG4	TBRG4	-	chr7:45139698-45151317	1a	1aU	OK	20.5787	22.3127	0.116716	-0.308804	0.75747	1	no
I have no idea why the sequences are 100% overlapping like this; in the GTF, the SNO is only 45143948-45144081.

Anyway, this was one of my more depressing observations while attempting to figure out how the heck to validate RNA-seq quantifications. It may just be a bug in the software (given that the genomic regions are changing,it seems like it must be), but even conceptually, no matter how good(/complicated) you make your statistical models, you're going to misassign some reads.

I'll go ahead and re-run this with 2.11 tonight just for kicks. It was done in 2.02 originally.

Last edited by jparsons; 07-31-2013 at 09:54 AM. Reason: adding last bit.
jparsons is offline   Reply With Quote
Old 07-31-2013, 10:25 AM   #11
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by jparsons View Post
The two transcripts happen to be a regular transcript and its accompanying snoRNA ; it's certainly understandable that the reads are misassigned. At least it's not being called DE, but when the FPKM varies between 0 and 20 (or 20 and 40) it casts considerable doubt on the accuracy of the assignment. There are over 1000 reads in the 12kb region.


Code:
SNORA5A	SNORA5A	-	chr7:45139698-45151317	1a	1aU	OK	42.1291	0	-1.79769e+308	-1.79769e+308	0.402342	1	no
TBRG4	TBRG4	-	chr7:45139698-45151317	1a	1aU	OK	20.5787	22.3127	0.116716	-0.308804	0.75747	1	no
I have no idea why the sequences are 100% overlapping like this; in the GTF, the SNO is only 45143948-45144081.

Anyway, this was one of my more depressing observations while attempting to figure out how the heck to validate RNA-seq quantifications. It may just be a bug in the software (given that the genomic regions are changing,it seems like it must be), but even conceptually, no matter how good(/complicated) you make your statistical models, you're going to misassign some reads.

I'll go ahead and re-run this with 2.11 tonight just for kicks. It was done in 2.02 originally.

Wow.. that one is a tricky case. They are both transcribed off the minus strand. I looked at it in the genome browser:

http://genome-euro.ucsc.edu/cgi-bin/...3&ensGene=pack

But I'm confused about the annotation, the Refseq track shows SNORA5A is sitting between TBRG4 exons. While UCSC genes track shows overlapping of TBRG4 exons with (a larger) SNORA5C.

Ensembl also has the smaller SNORA5A and SNORA5C in between exons.

What are your settings for cuffdiff? are you using --compatible-hits-norm ?
I would hope it would make things agree more with the GTF annotation. The defaults are not always the best and are actually flipped between cufflinks and cuffdiff (I have found from personal experience). Also the defaults claimed are not always the true defaults. For example --max-frag-multihits is listed as unlimited in cuffdiff, but it's actually set to 1 in the program.

Last edited by NGSfan; 07-31-2013 at 10:29 AM.
NGSfan is offline   Reply With Quote
Old 07-31-2013, 10:51 AM   #12
jparsons
Member
 
Location: SF Bay Area

Join Date: Feb 2012
Posts: 62
Default

Unless "default=TRUE" isn't actually true for compatible-hits-norm, yes i am using it. I wasn't using the flag, however.

Interestingly, RSEM also says that the SNO disappears in 1aU. It gets the lengths correct, though.

(Different reference because obviously these programs are picky about references so you can't directly compare them without doing way too much extra work):

Code:
 
sample	gene_id	transcript_id			length	effectivelength	exp.count	TPM	FPKM 
1a:	ENSG00000206838	ENST00000384111		134.00	84.34	1.00	1.58	0.95
1aU:	ENSG00000206838	ENST00000384111		0.00	0.00	0.00	0.00	0.00

1a:	ENSG00000136270	ENST00000258770...	2020.44	1966.21	690.00	44.79	27.03

1aU:	ENSG00000136270	ENST00000258770...	1953.33	1900.96	585.00	47.69	28.96
Cuffdiff 2.11, with or without the compatible-hits-norm flag, give the same quantifications as the older version, although they're now NOTEST rather than OK. Whatever was going on with the length of this poor little SNO is still happening in the latest version of the software.

Last edited by jparsons; 08-01-2013 at 10:18 AM.
jparsons is offline   Reply With Quote
Old 08-02-2013, 04:06 AM   #13
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by jparsons View Post
Cuffdiff 2.11, with or without the compatible-hits-norm flag, give the same quantifications as the older version, although they're now NOTEST rather than OK. Whatever was going on with the length of this poor little SNO is still happening in the latest version of the software.
Thanks for the update jparsons. Interesting that it switched to NOTEST in the newer version.

Are you using the same Ensembl GTF with Cuffdiff as you do with RSEM?
NGSfan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO