Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Cufflinks, differentially expressed genes statsteam Bioinformatics 5 11-15-2013 11:28 AM
DESeq: "NA" generated in the resulted differentially expressed genes idyll_ty RNA Sequencing 8 05-02-2012 03:28 PM
DEseq: zero differentially expressed regions found rebrendi Bioinformatics 3 12-22-2011 05:00 AM
Detecting differentially expressed genes using aligner outputs questioner Bioinformatics 6 11-03-2011 07:15 AM
gene ontology over-representation of differentially expressed genes damiankao Bioinformatics 10 10-20-2011 01:57 PM

Thread Tools
Old 11-27-2011, 02:49 PM   #1
Location: Purdue University

Join Date: Nov 2009
Posts: 19
Default DESeq and EdgeR: too many differentially expressed genes!?!?


For a class project, I am mapping reads and counting reads per gene with RSEM for five different published RNA-seq datasets located at GEO / SRA. Most of the datasets were from raw RNA-seq data. Then, I am taking the outputs of RSEM and using them as inputs for DESeq and EdgeR. The goal of the project is to simply compare EdgeR and DESeq results and try to understand why they are different.

My problem is that for 4/5 of the datasets, I am seeing that several genes are differentially expressed even after filtering low count data.

I applied some heavy-duty arbitrary non-specific filtering and it still did not help. In this filtration, I took the max number of reads for each sample (in columns) and sorted the genes (in rows) by their maxes from highest to lowest. Then, I removed the bottom 75% of the data. Additionally, from the remaining 25% of this data, I filtered out genes where the ranks (with rank #1 having the highest number of reads) of each sample were all under 1000. I did this secondary filter to remove genes with excessive numbers of reads. This secondary filter did not help at all.

Despite this filtering, I cannot seem to get a reasonable number of differentially expressed genes. I am getting too many genes as mentioned earlier. There are 2 biological replicates for two different conditions in each of my datasets. The 4 datasets that are causing a problem are comparing gene expression between 1) embryo and endosperm in arabidopsis seeds, 2) benign and malignant prostate cancer cell lines, 3) mouse embryonic stem cells and mouse embryonic fibroblasts, and 4) kidney and liver tissue in human. I downloaded these datasets from GEO / SRA in sra.lite format and converted them to fastq. They were all single-read data.

The RNA-Seq dataset that seems to be working is comparing gene expression in estrogen-treated and untreated breast cancer cell lines. I obtained about 500 to 1000 differentially expressed genes for adjusted pvalues between 0.01 and 0.1 by FDR (BH) for both packages. Also, for this dataset, only a bed file was available at GEO. So, I just wrote custom perl scripts to convert the bed file to a fasta file to input it into RSEM (I extracted 32 base reads from the start coordinates in the genome with the proper orientation). It is also important to note that I excluded reads that mapped to rRNA according to the bed file. The bedfile basically defined rRNA as a chromosome.

I thought it interesting to note that when I did not filter the low count data from the estrogen RNA-seq dataset, DESeq reported around 7000 genes for the pvalue range mentioned above whereas edgeR still reported around 500 to 1000 genes. In the unfiltered dataset there were ~27,000 genes.

I know this is a lot of information but if anyone has any suggestions or advice on how to deal with this issue, I would greatly appreciate it. My project is due Thursday, and I getting tied up in the preprocessing and not the analysis, which was the whole point of the project.

Purdue University
Graduate Student

Last edited by cutcopy11; 11-28-2011 at 02:13 PM.
cutcopy11 is offline   Reply With Quote
Old 11-27-2011, 03:24 PM   #2
Location: Purdue University

Join Date: Nov 2009
Posts: 19

Also, I would like to mention that the liver and kidney samples were derived from data
that was used in the EdgeR vignette.

I am not sure how their gene count data set was generated.

I used data from these sources:

Each of these accession numbers contain 2 sra.lite files.
They might be technical replicates. I am not sure.

Anyway, in edgeR I did a moderated tagwise dispersion. Like in the
vignette, I set commonDisp = FALSE in the "exactTest" step and used the default
prior in the "estimateTagwiseDisp" step.

Here is a comparison of the results:
EdgeR Vignette:
4438 under a adjusted p-value of 0.05
My result:
5977 under an adjusted p-value of 0.05
5727 under an adjusted p-value of 0.01

Although these numbers under 0.05 pvalue are kind of close, I feel that
the number of differentially expressed genes under 0.01 should be much

Thanks again,
cutcopy11 is offline   Reply With Quote
Old 11-27-2011, 05:29 PM   #3
Location: Purdue University

Join Date: Nov 2009
Posts: 19

The high number of differentially expressed genes in the 4 datasets may be due to the fact that I am essentially comparing apples and oranges.

The four datasets consist of either two different tissues or two different cell lines.
The estrogen dataset is from the same cell line but one is treated whereas the other is not. So, there is likely less differentiation.

The most differentiation occurs between the kidney and liver, the two different prostate cancer cell lines, and the two arabidopsis tissues (over 5000 genes)

Interestingly, between the mouse emybyronic stem cells and the mouse embryonic fibroblasts, there were 1000 and 2000 differentially expressed genes. These two tissues are likely more similar than the three comparisons above.
cutcopy11 is offline   Reply With Quote
Old 11-28-2011, 10:18 AM   #4
Simon Anders
Senior Member
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994

Maybe concentrate on one of the data sets and tell u a bit more on what you did. For example, post the exact commands you typed into R.

It is also important to figure out whether you have true biological replicates. If you compare, say, two technical replicates from liver with two technical replicates from kidney, you will end up with a huge list of differentially expressed genes, which, however, is biologically completely meaningless.
Simon Anders is offline   Reply With Quote
Old 11-28-2011, 02:09 PM   #5
Location: Purdue University

Join Date: Nov 2009
Posts: 19

Hi Simon,

I may dig up my commands later, but I just wanted to say that when I said that DESeq reported around 7000 genes for the unfiltered estrogen dataset whereas edgeR reported around 500 to 1000 genes for adjusted pvalues between 0.01 and 0.1, this was incorrect. I realized later that many of those genes in the DESeq results had adjusted p values listed as "NA" . For an adjusted pvalue cutoff of 0.01, DESeq reported 486 genes and edgeR reported 509, which makes sense as DESeq is more conservative with low count data. Of course, you know that.

I agree with you entirely about the technical replicate issue. That makes complete sense.

cutcopy11 is offline   Reply With Quote
Old 12-08-2011, 12:14 AM   #6
Junior Member
Location: Cambridge, UK

Join Date: Oct 2009
Posts: 9
Default edgeR - pvalue NA

Hi !

Though it is not directly related to your question, I thought I post the question to an "edgeR" thread: Using edgeR, I usually get a higher number of p values with "NA" for reasonably differentlially expressed genes. Why is that (due to TMM nolmalization?) and is there any way to get around this ?
sdm is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 08:40 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO