Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq and EdgeR: too many differentially expressed genes!?!?

    Hello,

    For a class project, I am mapping reads and counting reads per gene with RSEM for five different published RNA-seq datasets located at GEO / SRA. Most of the datasets were from raw RNA-seq data. Then, I am taking the outputs of RSEM and using them as inputs for DESeq and EdgeR. The goal of the project is to simply compare EdgeR and DESeq results and try to understand why they are different.

    My problem is that for 4/5 of the datasets, I am seeing that several genes are differentially expressed even after filtering low count data.

    I applied some heavy-duty arbitrary non-specific filtering and it still did not help. In this filtration, I took the max number of reads for each sample (in columns) and sorted the genes (in rows) by their maxes from highest to lowest. Then, I removed the bottom 75% of the data. Additionally, from the remaining 25% of this data, I filtered out genes where the ranks (with rank #1 having the highest number of reads) of each sample were all under 1000. I did this secondary filter to remove genes with excessive numbers of reads. This secondary filter did not help at all.

    Despite this filtering, I cannot seem to get a reasonable number of differentially expressed genes. I am getting too many genes as mentioned earlier. There are 2 biological replicates for two different conditions in each of my datasets. The 4 datasets that are causing a problem are comparing gene expression between 1) embryo and endosperm in arabidopsis seeds, 2) benign and malignant prostate cancer cell lines, 3) mouse embryonic stem cells and mouse embryonic fibroblasts, and 4) kidney and liver tissue in human. I downloaded these datasets from GEO / SRA in sra.lite format and converted them to fastq. They were all single-read data.

    The RNA-Seq dataset that seems to be working is comparing gene expression in estrogen-treated and untreated breast cancer cell lines. I obtained about 500 to 1000 differentially expressed genes for adjusted pvalues between 0.01 and 0.1 by FDR (BH) for both packages. Also, for this dataset, only a bed file was available at GEO. So, I just wrote custom perl scripts to convert the bed file to a fasta file to input it into RSEM (I extracted 32 base reads from the start coordinates in the genome with the proper orientation). It is also important to note that I excluded reads that mapped to rRNA according to the bed file. The bedfile basically defined rRNA as a chromosome.

    I thought it interesting to note that when I did not filter the low count data from the estrogen RNA-seq dataset, DESeq reported around 7000 genes for the pvalue range mentioned above whereas edgeR still reported around 500 to 1000 genes. In the unfiltered dataset there were ~27,000 genes.

    I know this is a lot of information but if anyone has any suggestions or advice on how to deal with this issue, I would greatly appreciate it. My project is due Thursday, and I getting tied up in the preprocessing and not the analysis, which was the whole point of the project.

    Thanks,
    Clayton
    Purdue University
    Graduate Student
    Last edited by cutcopy11; 11-28-2011, 03:13 PM.

  • #2
    Also, I would like to mention that the liver and kidney samples were derived from data
    that was used in the EdgeR vignette.

    I am not sure how their gene count data set was generated.

    I used data from these sources:
    kidney

    liver


    Each of these accession numbers contain 2 sra.lite files.
    They might be technical replicates. I am not sure.

    Anyway, in edgeR I did a moderated tagwise dispersion. Like in the
    vignette, I set commonDisp = FALSE in the "exactTest" step and used the default
    prior in the "estimateTagwiseDisp" step.

    Here is a comparison of the results:
    EdgeR Vignette:
    4438 under a adjusted p-value of 0.05
    My result:
    5977 under an adjusted p-value of 0.05
    5727 under an adjusted p-value of 0.01

    Although these numbers under 0.05 pvalue are kind of close, I feel that
    the number of differentially expressed genes under 0.01 should be much
    less.

    Thanks again,
    Clayton

    Comment


    • #3
      The high number of differentially expressed genes in the 4 datasets may be due to the fact that I am essentially comparing apples and oranges.

      The four datasets consist of either two different tissues or two different cell lines.
      The estrogen dataset is from the same cell line but one is treated whereas the other is not. So, there is likely less differentiation.

      The most differentiation occurs between the kidney and liver, the two different prostate cancer cell lines, and the two arabidopsis tissues (over 5000 genes)

      Interestingly, between the mouse emybyronic stem cells and the mouse embryonic fibroblasts, there were 1000 and 2000 differentially expressed genes. These two tissues are likely more similar than the three comparisons above.

      Comment


      • #4
        Maybe concentrate on one of the data sets and tell u a bit more on what you did. For example, post the exact commands you typed into R.

        It is also important to figure out whether you have true biological replicates. If you compare, say, two technical replicates from liver with two technical replicates from kidney, you will end up with a huge list of differentially expressed genes, which, however, is biologically completely meaningless.

        Comment


        • #5
          Hi Simon,

          I may dig up my commands later, but I just wanted to say that when I said that DESeq reported around 7000 genes for the unfiltered estrogen dataset whereas edgeR reported around 500 to 1000 genes for adjusted pvalues between 0.01 and 0.1, this was incorrect. I realized later that many of those genes in the DESeq results had adjusted p values listed as "NA" . For an adjusted pvalue cutoff of 0.01, DESeq reported 486 genes and edgeR reported 509, which makes sense as DESeq is more conservative with low count data. Of course, you know that.

          I agree with you entirely about the technical replicate issue. That makes complete sense.

          -Clayton

          Comment


          • #6
            edgeR - pvalue NA

            Hi !

            Though it is not directly related to your question, I thought I post the question to an "edgeR" thread: Using edgeR, I usually get a higher number of p values with "NA" for reasonably differentlially expressed genes. Why is that (due to TMM nolmalization?) and is there any way to get around this ?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X