Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • very different numbers of differentially expressed genes by DESeq

    Hi, I am using DES-seq to identify the DE genes from mouse RNA-seq datasets. When I used the estimateDispersions( cds ), only less than 20 genes were identified. I then change the option setting as estimateDispersions( cds, sharingMode=”fit-only” ), I got about 1000 genes. Could someone help me why the results are so different? Are they really meaningful? By the way, I only got about 10 genes when cuffdiff() was used. The mapping files (bam) were generated by tophat2 and processed by htseq-count for the countTable. I had 5 biological replicates for samples and 3 biological replicates for controls. Thanks a lot.

  • #2
    Read the vignette and other information:

    After the empirical dispersion values have been computed for each gene, a dispersion-mean relationship is fitted for sharing information across genes in order to reduce variability of the dispersion estimates. After that, for each gene, we have two values: the empirical value (derived only from this gene's data), and the fitted value (i.e., the dispersion value typical for genes with an average expression similar to those of this gene). The sharingMode argument specifies which of these two values will be written to the featureData's disp_ columns and hence will be used by the functions nbinomTest and fitNbinomGLMs.

    fit-only - use only the fitted value, i.e., the empirical value is used only as input to the fitting, and then ignored. Use this only with very few replicates, and when you are not too concerned about false positives from dispersion outliers, i.e. genes with an unusually high variability.

    maximum - take the maximum of the two values. This is the conservative or prudent choice, recommended once you have at least three or four replicates and maybe even with only two replicates.

    gene-est-only - No fitting or sharing, use only the empirical value. This method is preferable when the number of replicates is large and the empirical dispersion values are sufficiently reliable. If the number of replicates is small, this option may lead to many cases where the dispersion of a gene is accidentally underestimated and a false positive arises in the subsequent testing.


    The default I believe is "maximum". Using the "fit-only" argument increases the number of false positives.

    You have 3 and 5 replicates. Therefore you should stick to "maximum". The increased number of DE genes you get from "fit-only" are most likely all false positives.
    Last edited by chadn737; 08-22-2012, 08:09 AM.

    Comment


    • #3
      Originally posted by chadn737 View Post
      Read the vignette and other information:

      After the empirical dispersion values have been computed for each gene, a dispersion-mean relationship is fitted for sharing information across genes in order to reduce variability of the dispersion estimates. After that, for each gene, we have two values: the empirical value (derived only from this gene's data), and the fitted value (i.e., the dispersion value typical for genes with an average expression similar to those of this gene). The sharingMode argument specifies which of these two values will be written to the featureData's disp_ columns and hence will be used by the functions nbinomTest and fitNbinomGLMs.

      fit-only - use only the fitted value, i.e., the empirical value is used only as input to the fitting, and then ignored. Use this only with very few replicates, and when you are not too concerned about false positives from dispersion outliers, i.e. genes with an unusually high variability.

      maximum - take the maximum of the two values. This is the conservative or prudent choice, recommended once you have at least three or four replicates and maybe even with only two replicates.

      gene-est-only - No fitting or sharing, use only the empirical value. This method is preferable when the number of replicates is large and the empirical dispersion values are sufficiently reliable. If the number of replicates is small, this option may lead to many cases where the dispersion of a gene is accidentally underestimated and a false positive arises in the subsequent testing.


      The default I believe is "maximum". Using the "fit-only" argument increases the number of false positives.

      You have 3 and 5 replicates. Therefore you should stick to "maximum". The increased number of DE genes you get from "fit-only" are most likely all false positives.
      Thanks a lot for your reply. You are right fit-only potentially incease the number of false positive. however, I didn't expect the difference was so big.
      I also applied same analysis to those 5 replicated samples with another 5 replicate samples in different biological conditions. With the default setting, I got zero significant DE gene. I got about 700 genes with the setting changed to fit-only. I am attaching a graphic for the estimated dispersions. Is it in very high variability? so, it could justify the fit-only usage? Thanks.
      Attached Files

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        Yesterday, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      45 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Working...
      X