SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
gene expression in log scale and qPCR miRNA expression data Sharmi General 1 03-24-2015 06:39 AM
Expression quantification/differential expression gene analysis by RNA-Seq chenjy Bioinformatics 12 08-02-2013 03:06 AM
Looking for a Technical Expert for advice on Gene Targeting strategies PacGen Sample Prep / Library Generation 0 11-06-2012 01:52 PM

Reply
 
Thread Tools
Old 08-16-2014, 11:08 AM   #1
thickrick99
Member
 
Location: Washington

Join Date: Jul 2014
Posts: 21
Question Advice for Statistics in Gene Expression??

Hi Everyone,

I am a beginner in RNA-seq and am working on a project dealing with differential gene expression. I am trying to compare the gene expression between two populations each having 2-5 samples.

I am looking for genes that are differentially expressed between these two populations. I thought that it would be interesting to look at genes that have the exact same expression, but have different isoform expression levels. I ran Tophat, and Cufflinks to get my gene.fpkm_tracking and isoform.fpkm_tracking.

I am looking for some advice on statistical tests that I can use to look for genes that are statistically expressed similarly using the FPKM values of the in the gene.fpkm_tracking file, and then compare the FPKM values for the isoforms to see if there is a statistically significant difference between the isoforms.

I was looking online and I found some ideas for tests ranging from t-Tests to Poisson Distributions to negative binomial distribution (which I have no idea about). Another thing that I found is that a lot of existing programs like edgeR or DEseq use the raw read count data but cufflinks only outputs the FPKM values. How should I go about this?

I wasn't sure if CuffDiff would be a good option for what I am trying to do. Also, in general, if I were to compare the differential gene expression using the genes.fpkm_tracking file w/ the FPKM values for each gene from cufflinks, what types of Plots are ideal in this field. I have heard about density plots and heat maps but I am not sure and wanted some advice from anyone else who has done this before. I am familiar with R if that helps

Thanks in advance!!!
thickrick99 is offline   Reply With Quote
Old 08-16-2014, 12:54 PM   #2
Bukowski
Senior Member
 
Location: UK

Join Date: Jan 2010
Posts: 390
Default

Quote:
Originally Posted by thickrick99 View Post
Hi Everyone,

I am a beginner in RNA-seq and am working on a project dealing with differential gene expression. I am trying to compare the gene expression between two populations each having 2-5 samples.

I am looking for genes that are differentially expressed between these two populations. I thought that it would be interesting to look at genes that have the exact same expression, but have different isoform expression levels. I ran Tophat, and Cufflinks to get my gene.fpkm_tracking and isoform.fpkm_tracking.
You know that cuffdiff basically does this for you? I'm not sure why you're seeking a statistical test here (maybe I'm missing the point). It's not uncommon to find genes where they are not significantly differentially expressed at the gene level because there is no empirical change, but have significantly differentially expressed isoforms, and one goes up the same amount as the other goes down - this is just a matter of checking what is significant in each case in the cuffdiff outputs.

The tracking files are not where you need to be looking, I'd advise you look at the .diff files.

Another source of these situations is the splicing.diff file which will show you the genes which have alterations in the distribution of isoforms between conditions
Bukowski is offline   Reply With Quote
Old 08-16-2014, 01:04 PM   #3
thickrick99
Member
 
Location: Washington

Join Date: Jul 2014
Posts: 21
Default

Ok thanks for the advice! I haven't run cuffdiff yet but I will now. So what you're saying is, I can look for genes with similar expression in the .diff file and then look at the isoforms that have a difference using the cuffdiff output files right?

Do you (or anyone else) have a suggestion regarding what type of plots I can use. It seems like around 5000+ genes have similar expression so what would be the best way to plot such a large amount of data?


Thanks!

Last edited by thickrick99; 08-16-2014 at 01:33 PM.
thickrick99 is offline   Reply With Quote
Old 08-16-2014, 01:33 PM   #4
thickrick99
Member
 
Location: Washington

Join Date: Jul 2014
Posts: 21
Default

I understand that I don't actually need a test statistic since programs like cuffdiff can do it for me. However, as part of the project I am working on, I need to program a test statistic with R in order to get and compare the p values rather than having the programs do it for me. So how should I choose the test statistic based on what I am doing (2 populations each with 2-5 samples and the populations are unpaired). I am trying to compare the differential gene expression and I am not sure how to choose a statistical test for this using the FPKM values from cufflinks.
thickrick99 is offline   Reply With Quote
Old 08-17-2014, 01:02 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The most common test statistic in your circumstance would be the T-statistic. That ends up being similar to what cuffdiff is using internally anyway.
dpryan is offline   Reply With Quote
Old 08-17-2014, 04:59 AM   #6
thickrick99
Member
 
Location: Washington

Join Date: Jul 2014
Posts: 21
Default

Yeah that's what I was planning to do as well. But I read online about other tests like poisson or negative binomial and I wasn't sure if these are better than using the T-test?
thickrick99 is offline   Reply With Quote
Old 08-17-2014, 06:14 AM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Only the T-test is compatible with FPKMs.
dpryan is offline   Reply With Quote
Old 08-17-2014, 06:28 AM   #8
thickrick99
Member
 
Location: Washington

Join Date: Jul 2014
Posts: 21
Default

So if I wanted to use more complicated tests with poisson or negative binomial I would have to use the raw read count data right? Where do I access this information assuming I used top hat and then cufflinks. Do I have to convert fpkm to the read count or are the read count in the accept_hits.bam file?
thickrick99 is offline   Reply With Quote
Old 08-17-2014, 07:14 AM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The general workflow is to map with tophat (or STAR or whatever else you want to use) and then quantify with htseq-count or featureCounts. The latter two will give you counts that you can use in a negative binomial model. If you used cufflinks to find new features, then just run it first and use the merged GTF file with the aforementioned counting programs. I wouldn't bother with a Poisson model, it's not worth your time.

BTW, there's no great way to convert between FPKM and raw counts, since the latter doesn't use multimappers while the former does.
dpryan is offline   Reply With Quote
Old 08-17-2014, 08:20 AM   #10
thickrick99
Member
 
Location: Washington

Join Date: Jul 2014
Posts: 21
Default

ok thanks Devon! So I have my count data, but how do I got about using the negative binomial model. I am not very familiar with these types of models (since I just started in this field very recently). For the project I don't want to use existing tools like edgeR which use negative binomial models but would prefer programming my own model into R. Is there a way to do this and how would I get the p-values to compare the differential expression of the genes.

It would be great if you could give me some advice on how to program the negative binomial model into R and any resources that you think would help me do this in order to compare differential gene expression between the two populations.

Thanks for all your help!
thickrick99 is offline   Reply With Quote
Old 08-17-2014, 08:44 AM   #11
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The simplest way would be to use glm.nb() from the MASS library.
dpryan is offline   Reply With Quote
Old 08-18-2014, 12:51 AM   #12
Gordon Smyth
Member
 
Location: Melbourne, Australia

Join Date: Apr 2011
Posts: 91
Default

Quote:
Originally Posted by thickrick99 View Post
I am not very familiar with these types of models (since I just started in this field very recently). For the project I don't want to use existing tools like edgeR which use negative binomial models but would prefer programming my own model into R.
If you are not familiar with the relevant models, wouldn't it make sense to use existing tools?

Anyway, the edgeR methods, dispersion estimation especially, are sufficiently sophisticated that it is very unlikely you could reproduce them yourself in any reasonable amount of time.

For example, glm.nb() implements simple maximum likelihood for the dispersion parameter, which will markedly underestimate the true dispersions for RNA-seq data, and hence give overly liberal DE results. You need software like edgeR to do better, there's no easy way around it.
Gordon Smyth is offline   Reply With Quote
Old 08-18-2014, 06:09 AM   #13
thickrick99
Member
 
Location: Washington

Join Date: Jul 2014
Posts: 21
Default

Ok thanks for your help Gordon. So I tried using the t-test on the FPKM values but I realized that I can't do the test because of the 0 FPKM values. Is there an accepted way to get around this so that I can do the t-test on the FPKM values?
thickrick99 is offline   Reply With Quote
Old 08-18-2014, 04:10 PM   #14
Gordon Smyth
Member
 
Location: Melbourne, Australia

Join Date: Apr 2011
Posts: 91
Default

It is impossible to do a high performance statistical test on FPKM values alone, because they have varying precisions, and the precision depends on the original count size rather than on the FPKM value itself.
Gordon Smyth is offline   Reply With Quote
Reply

Tags
cuffdiff, differential expression, rna-seq, statistical analysis

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:13 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO