Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [EdgeR Analysis] P-value Distribution

    1. I did edgeR analysis in order to find differentially expressed genes across different time points.

    2. I've got p-values for all the genes and made histogram.

    3. However, seeing p-value distribution, there exist genes that have a extreme peak in the range of (0.70,0.71). Please see the attached file. I think this implies that lots of genes are highly concentrated in this range of p-value. I am trying to figure out what causes this. Though I had a look at the corresponding read counts, I couldn't find any particular patterns.

    I wonder in what cases this peak in p-value distribution happens?

    I would really appreciate any kinds of tips.
    Thank you in advance.
    Attached Files

  • #2
    Usually, this is the effect of many genes with small count values. Maybe you have a lot of genes with, say, 3 reads in total over all replicates from group A, and 1 read in all replicates in group B, and this ratio always gives exactly the same p value. Plotting p values against total read counts (i.e., against the row sums of the count matrix) is often helpful to understand such histograms.

    So, no, this peak is not that unusual and will not explain why you no significance in your results.

    Comment


    • #3
      Thank you for the prompt reply.
      Last edited by syintel87; 05-02-2013, 02:42 PM.

      Comment


      • #4
        Thank you for the prompt reply.

        I did a pairwise exact test, using the command, exactTest( data , pair=c("T1", "T2") , dispersion = "tagwise" ). Then, I generated p-value distribution of this test. The attachment is a part of table including p-value in the first column and read counts in other columns.

        Would you please have a look at the attached file?
        Attached Files

        Comment


        • #5
          Originally posted by Simon Anders View Post
          Usually, this is the effect of many genes with small count values. Maybe you have a lot of genes with, say, 3 reads in total over all replicates from group A, and 1 read in all replicates in group B, and this ratio always gives exactly the same p value. Plotting p values against total read counts (i.e., against the row sums of the count matrix) is often helpful to understand such histograms.

          So, no, this peak is not that unusual and will not explain why you no significance in your results.
          I did the work that you mentioned, summing counts over replicates within each group. But it does not seem that genes with the same ratio of one sum to another have the same p-value. Would you please give me a piece of advice about what is wrong with the attached table that I generated? Also, for the genes having p-value in the range of (0.70,0.71), do you think some particular pattern is observed?

          Thank you in advance.
          Attached Files

          Comment


          • #6
            Originally posted by Simon Anders View Post
            Usually, this is the effect of many genes with small count values. Maybe you have a lot of genes with, say, 3 reads in total over all replicates from group A, and 1 read in all replicates in group B, and this ratio always gives exactly the same p value. Plotting p values against total read counts (i.e., against the row sums of the count matrix) is often helpful to understand such histograms.

            So, no, this peak is not that unusual and will not explain why you no significance in your results.
            I've heard that ideally p-value distribution should look like the attached file.
            Even though I can see why and how some genes have the same p-value, if there is an extreme peak, my p-value distribution is biased from the ideal one.
            Then, how can I explain this sudden peak with some biological insight (e.g. correlation between genes or something) rather than with just mathematical formula or calculation?
            Attached Files

            Comment


            • #7
              Originally posted by Simon Anders View Post
              Usually, this is the effect of many genes with small count values. Maybe you have a lot of genes with, say, 3 reads in total over all replicates from group A, and 1 read in all replicates in group B, and this ratio always gives exactly the same p value. Plotting p values against total read counts (i.e., against the row sums of the count matrix) is often helpful to understand such histograms.

              So, no, this peak is not that unusual and will not explain why you no significance in your results.


              I drew plots
              1) by setting p-value on x-axis and counts on y-axis.
              2) by setting p-value on x-axis and logCPM on y-axis.

              I wonder
              1) why values which are very very close to 0 on y-axis have domain from 0 to 0.8 on x-axis. That is, Genes that have ratio that is close to 0 are dispersed over p-values. This implies that genes that have pretty distinguished count reads have diverse p-values. But, I think if ratio of one sum to another is close to 0, those genes are expected to have low p-values.
              2) why this plot has positive relationship at bottom part and negative relationship at upper part. This implies that even a gene with ratio close to 1 could have very low p-value.

              I am so curious about this plot.
              I would really appreciate any tips on interpretation of this plot.
              Thank you in advance.
              Attached Files

              Comment


              • #8
                I don't think making many plots of p values will help you. What's up with your library sizes? It seems that most genes in T2 have only a very few counts.

                Comment


                • #9
                  Originally posted by Simon Anders View Post
                  I don't think making many plots of p values will help you. What's up with your library sizes? It seems that most genes in T2 have only a very few counts.
                  1.
                  The reason for making multiple plots is to gain high resolution. Pictures should be assembled in order to make one complete figure. There is positive relationship in the range (0,0.001) on y-axis and negative relationship in the range (0.001,1.000) on y-axis.

                  2.
                  The table below is about library size.
                  group * lib.size * norm.factors
                  T1 * 22705534 * 10.53656319
                  T1 * 24463594 * 8.27944152
                  T2 * 11440163 * 0.01852953
                  T2 * 178857 * 1.23101359
                  T3 * 2232541 * 0.28527335
                  T3 * 90552 * 4.29918424
                  T3 * 855331 * 0.40975614

                  What insight could I have from "plot of p-value & count" and "p-value distribution"?

                  Always thank you.
                  Attached Files

                  Comment


                  • #10
                    You have hardly any useful reads for T2! It seems to less about two orders of magnitude less than in T1.
                    I doubt that you have enough data on T2 to perform any inference.

                    Comment


                    • #11
                      Originally posted by Simon Anders View Post
                      You have hardly any useful reads for T2! It seems to less about two orders of magnitude less than in T1.
                      I doubt that you have enough data on T2 to perform any inference.
                      The reason why there exists too small amount of reads is that this is data about infecting worm. So data at time point T1 is extracted at worm's egg stage, while data at time point T2 is extracted from infected host at the next time point. This might have caused small library size.

                      So you mean I cannot have significant conclusion with this data set?

                      Comment


                      • #12
                        Why does this mean that you get less reads? The number of reads you get from a sequencing lane is typically independent of the sample.

                        Or do you mean that in T2, the vast majority of your reads map to the host, and only a percent or so map to the worm genome? You know this is the kind of details you should mention when you start such a thread.

                        Also, why are the two T2 samples so different? (Once high library size but low normalization factor, once the other way round.)

                        Comment


                        • #13
                          Thank you so much for advice.

                          Comment


                          • #14
                            Originally posted by Simon Anders View Post
                            Why does this mean that you get less reads? The number of reads you get from a sequencing lane is typically independent of the sample.

                            Or do you mean that in T2, the vast majority of your reads map to the host, and only a percent or so map to the worm genome? You know this is the kind of details you should mention when you start such a thread.

                            Also, why are the two T2 samples so different? (Once high library size but low normalization factor, once the other way round.)
                            1. Yes, I meant that in T2, the vast majority of your reads map to the host, and only a percent or so map to the worm genome. I am so sorry for not having mentioned that.

                            2. Actually, I have seven time points, egg, juvenile, t1, t2, t3, t4, and t5.
                            Since there were no replicates, I assigned
                            - (egg and juvenile) to T1,
                            - (t1 and t2) to T2,
                            - (t3, t4, t5) to T3.
                            I guess this might have caused so different library size.
                            Am I doing with wrong approach?

                            Comment


                            • #15
                              If for most of your time points you have only a few ten thousand usable reads, because most of your reads were used up by host mRNA, you may have to little usable data and the thing to do might be to improve on the wet-lab side by finding a better way to separate worm from host tissue.

                              If you want to go on working with your data to see whether you can see at least a few things, maybe start by making scatter plot or raw counts (each sample against each other sample).

                              You pseudo-replication scheme is not that good, either: The general idea of inference in a two group comparison is to find genes that show stronger differences between groups than within groups. So, you are now looking for genes that change more strongly between T1 and T2 than between either egg and juvenile or between t1 and t2. Why should the changes from juvenile to t2 be stronger than from egg to juvenile?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 11:49 AM
                              0 responses
                              15 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              62 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X