Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • cutcopy11
    Member
    • Nov 2009
    • 19

    DESeq and EdgeR: too many differentially expressed genes!?!?

    Hello,

    For a class project, I am mapping reads and counting reads per gene with RSEM for five different published RNA-seq datasets located at GEO / SRA. Most of the datasets were from raw RNA-seq data. Then, I am taking the outputs of RSEM and using them as inputs for DESeq and EdgeR. The goal of the project is to simply compare EdgeR and DESeq results and try to understand why they are different.

    My problem is that for 4/5 of the datasets, I am seeing that several genes are differentially expressed even after filtering low count data.

    I applied some heavy-duty arbitrary non-specific filtering and it still did not help. In this filtration, I took the max number of reads for each sample (in columns) and sorted the genes (in rows) by their maxes from highest to lowest. Then, I removed the bottom 75% of the data. Additionally, from the remaining 25% of this data, I filtered out genes where the ranks (with rank #1 having the highest number of reads) of each sample were all under 1000. I did this secondary filter to remove genes with excessive numbers of reads. This secondary filter did not help at all.

    Despite this filtering, I cannot seem to get a reasonable number of differentially expressed genes. I am getting too many genes as mentioned earlier. There are 2 biological replicates for two different conditions in each of my datasets. The 4 datasets that are causing a problem are comparing gene expression between 1) embryo and endosperm in arabidopsis seeds, 2) benign and malignant prostate cancer cell lines, 3) mouse embryonic stem cells and mouse embryonic fibroblasts, and 4) kidney and liver tissue in human. I downloaded these datasets from GEO / SRA in sra.lite format and converted them to fastq. They were all single-read data.

    The RNA-Seq dataset that seems to be working is comparing gene expression in estrogen-treated and untreated breast cancer cell lines. I obtained about 500 to 1000 differentially expressed genes for adjusted pvalues between 0.01 and 0.1 by FDR (BH) for both packages. Also, for this dataset, only a bed file was available at GEO. So, I just wrote custom perl scripts to convert the bed file to a fasta file to input it into RSEM (I extracted 32 base reads from the start coordinates in the genome with the proper orientation). It is also important to note that I excluded reads that mapped to rRNA according to the bed file. The bedfile basically defined rRNA as a chromosome.

    I thought it interesting to note that when I did not filter the low count data from the estrogen RNA-seq dataset, DESeq reported around 7000 genes for the pvalue range mentioned above whereas edgeR still reported around 500 to 1000 genes. In the unfiltered dataset there were ~27,000 genes.

    I know this is a lot of information but if anyone has any suggestions or advice on how to deal with this issue, I would greatly appreciate it. My project is due Thursday, and I getting tied up in the preprocessing and not the analysis, which was the whole point of the project.

    Thanks,
    Clayton
    Purdue University
    Graduate Student
    Last edited by cutcopy11; 11-28-2011, 03:13 PM.
  • cutcopy11
    Member
    • Nov 2009
    • 19

    #2
    Also, I would like to mention that the liver and kidney samples were derived from data
    that was used in the EdgeR vignette.

    I am not sure how their gene count data set was generated.

    I used data from these sources:
    kidney

    liver


    Each of these accession numbers contain 2 sra.lite files.
    They might be technical replicates. I am not sure.

    Anyway, in edgeR I did a moderated tagwise dispersion. Like in the
    vignette, I set commonDisp = FALSE in the "exactTest" step and used the default
    prior in the "estimateTagwiseDisp" step.

    Here is a comparison of the results:
    EdgeR Vignette:
    4438 under a adjusted p-value of 0.05
    My result:
    5977 under an adjusted p-value of 0.05
    5727 under an adjusted p-value of 0.01

    Although these numbers under 0.05 pvalue are kind of close, I feel that
    the number of differentially expressed genes under 0.01 should be much
    less.

    Thanks again,
    Clayton

    Comment

    • cutcopy11
      Member
      • Nov 2009
      • 19

      #3
      The high number of differentially expressed genes in the 4 datasets may be due to the fact that I am essentially comparing apples and oranges.

      The four datasets consist of either two different tissues or two different cell lines.
      The estrogen dataset is from the same cell line but one is treated whereas the other is not. So, there is likely less differentiation.

      The most differentiation occurs between the kidney and liver, the two different prostate cancer cell lines, and the two arabidopsis tissues (over 5000 genes)

      Interestingly, between the mouse emybyronic stem cells and the mouse embryonic fibroblasts, there were 1000 and 2000 differentially expressed genes. These two tissues are likely more similar than the three comparisons above.

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        Maybe concentrate on one of the data sets and tell u a bit more on what you did. For example, post the exact commands you typed into R.

        It is also important to figure out whether you have true biological replicates. If you compare, say, two technical replicates from liver with two technical replicates from kidney, you will end up with a huge list of differentially expressed genes, which, however, is biologically completely meaningless.

        Comment

        • cutcopy11
          Member
          • Nov 2009
          • 19

          #5
          Hi Simon,

          I may dig up my commands later, but I just wanted to say that when I said that DESeq reported around 7000 genes for the unfiltered estrogen dataset whereas edgeR reported around 500 to 1000 genes for adjusted pvalues between 0.01 and 0.1, this was incorrect. I realized later that many of those genes in the DESeq results had adjusted p values listed as "NA" . For an adjusted pvalue cutoff of 0.01, DESeq reported 486 genes and edgeR reported 509, which makes sense as DESeq is more conservative with low count data. Of course, you know that.

          I agree with you entirely about the technical replicate issue. That makes complete sense.

          -Clayton

          Comment

          • sdm
            Junior Member
            • Oct 2009
            • 9

            #6
            edgeR - pvalue NA

            Hi !

            Though it is not directly related to your question, I thought I post the question to an "edgeR" thread: Using edgeR, I usually get a higher number of p values with "NA" for reasonably differentlially expressed genes. Why is that (due to TMM nolmalization?) and is there any way to get around this ?

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            20 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            38 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            45 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            49 views
            0 reactions
            Last Post SEQadmin2  
            Working...