Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to choose a FPKM cut-off

    Dear all,
    I met one problem about how to choose an FPKM cut-off when using cufflinks. Could anyone provides some detailed suggestion about how to choose appropriate FPKM cut-off to judge whether the corresponding gene is expressed or not? Now my RNA-seq data(90PE, generated by Illumina Hiseq2000) had about 40 million reads per sample.
    Thx.

  • #2
    Maybe my understanding about this question is not true, but I really want to get specifically expressed gene between distinct samples.
    Thanks for any suggestions.

    Comment


    • #3
      My lab includes Ambion ERCC spike-ins in our RNA samples. We use those to judge how far down expression is still quantitative.

      Comment


      • #4
        That's really not an answerable question. Lowly expressed genes could be important, especially for things like transcription factors. So, its pretty hard to come up with a number that matches with biological significance.

        Anyway, what is it you're trying to get from this RNAseq data? If you're looking for candidates to investigate further, I would suggest coming up with some sort of ranking based on adjusted p-value, fold change, FPKM (for cutoff or rankings I often just use the highest FPKM between all the samples), GO classes, any biological knowledge currently available, or anything else you can think off.

        Comment


        • #5
          I agree that lowly expressed genes can be important, however STILL TO ME IS IMPORTANT TO UNDERSTAND IF A FPKM < 0.0001 HAVE ANY SENSE. Example: you have two situation (A and B) with 3 replicate each and for a low expressing gene you get the following FPKM:
          A1: 0.00001 A2: 0.000012 A3: 0.000015 and B1: 0.001 B2: 0.0012 B3: 0.0015

          An statistical comparison (blind to biological meaning) will find that biological replicates are consistent. Then that gene will be accepted as differentially regulated (rejection of null hypothesis).
          Then the fold change will be big (100 in this example).
          However although the difference is real in the library not necessary will represent a different in gene expression. It is now that the composition of very abundant genes can affect the entire composition of the library, and poor expressed genes will be the most affected. Then the differences in this example could be the result of change of expression of very abundant genes.
          I still would like to know how to select for a meaningful cut off.

          Comment


          • #6
            If you're worried about changes in the most highly expressed genes being a major driving influence you can do a top quartile normalization in cufflinks. That is the whole reason the developers created that option.

            Otherwise, I don't know that anyone will be able to tell you what you want to hear. To me, I wouldn't worry about a differentially expressed gene under an FPKM of ~1, unless I had good reason to care about it. So, it all comes back to why you want to set this cut off in the first place. Are you coming up with candidates to screen for something? Is it just to create DE gene heatmaps and do GO/KEGG analysis? The biological reaon for the cutoff is important, So, I could help more if you answered my earlier question, what is it you're trying to get from this RNAseq data?

            Comment


            • #7
              I'm comparing the responses of two genotypes to a hormone. I extracted total RNA from whole seedlings. I suspect that the mutant genotype has an alterated response in the translational swithch, then I can not get ride of high expressed genes (difference in ribosomal genes can be important to me). In addition the mutant gene must be important in the meristem (which is a small population of cell from the whole organism). Then you expect that genes tissue specific and differentially expressed will be lowly represented in the library but still important to me.

              Comment


              • #8
                First, I don't think the upper quartile normalization removes the highly expressed genes, it just doesn't use the reads that map to them for the "per million reads" part of the FPKM. It is different from giving cufflinks a GTF file of genes to completely exclude.

                And for the broader biological question, if you're specifically interested in the meristem, why not do RNAseq specifically on that tissue (do they call it tissue in plants?). Or you could at least come up with candidates of lowly expressed, but differentially expressed, genes to do qPCR on the meristem. Otherwise, how are you going to know that the DE call for this gene is actually from changes in the meristem? Once you do this a couple times, especially if you have some positive controls you already expect to be specific to the meristem and differentially expressed in the meristem, you might come up with a FPKM range to expect from similar genes? But of course you'd still have to validate this meristem connection.

                Sorry, I just don't see a very clear way to get to where you want to be from your current data.

                Comment


                • #9
                  Wally THANK FOR YOUR TIME AND ANSWERS!
                  I had the wrong idea about quartile normalization! I got the idea that highly expressed genes were complete excluded from the analysis.
                  YES you can call meristematic tissue in plants. I give up with the idea to take samples from meristem for technical reasons, your picked up my intention, "identify differentialy expressed putative meristematic candidates"
                  But going back to the bioinformatic problem.
                  you said that you would not care about genes with FPKM less that one, ok may be I will take that as an start point.. but why FPKM < 1?

                  Comment


                  • #10
                    I kind of think about this in two ways with the researchers I work with.

                    first, presumably, the FPKM value tells you something about how rare or frequent a gene is within your sample. So presumably that means it could be true, depending on your sample, that a gene is expressed low enough to where it's not really making an impact. That kind of assumes that there are expressed genes floating around serving absolutely no purpose which is probably not the case. So although a gene has a low expression it could still be important especially since the nature of experimentation is to learn about things previously unknown.

                    second there's a technical aspect here but you need access to the read count data. read counts below a certain level are pretty unreliable. for example you might find very different counts for genes with less than 20 hits solely depending on which aligner you use and which method you use to count those hits. as the count levels go up they tend to become more stable. So clearly this is directly tied to your total read depth in the experiment.

                    A quick example of the technical aspect:
                    assume a 1,000 bp transcript. experiment 1 is 5,000,000 total reads and this transcript received 5 hits. This calculates out to an FPKM of 1.0. But that FPKM is based on only 5 hits which is entirely unreliable. experiment 2 has 100,000,000 total reads and this transcript has 100 hits. This also calculates out to an FPKM of 1.0 however this FPKM is much more reliable as it's based on 100 hits which is a more stable count level. the variance due to aligner error and count methods might only vary that count value by 5% whereas the count of 5 could vary by 80% or more.

                    I believe that cuffdiff/cufflinks attempts to inform us of these things kind of red-flags with their LOWDATA flag but I haven't really tried to confirm that.

                    so there IS a way to apply a cutoff of FPKMs by applying the technical limitations of your experiment to the count data. There isn't, however, a biologically relevant cutoff that I know of...nor, I'd think, would you want one.
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment


                    • #11
                      Originally posted by Wallysb01 View Post
                      If you're worried about changes in the most highly expressed genes being a major driving influence you can do a top quartile normalization in cufflinks. That is the whole reason the developers created that option.

                      Otherwise, I don't know that anyone will be able to tell you what you want to hear. To me, I wouldn't worry about a differentially expressed gene under an FPKM of ~1, unless I had good reason to care about it. So, it all comes back to why you want to set this cut off in the first place. Are you coming up with candidates to screen for something? Is it just to create DE gene heatmaps and do GO/KEGG analysis? The biological reaon for the cutoff is important, So, I could help more if you answered my earlier question, what is it you're trying to get from this RNAseq data?
                      Thank you for kind answers. Now my data include 3 distinct cells. I attempt to obtain cell-specific genes(only expressed in one cell).

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Advancing Precision Medicine for Rare Diseases in Children
                        by seqadmin




                        Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                        12-16-2024, 07:57 AM
                      • seqadmin
                        Recent Advances in Sequencing Technologies
                        by seqadmin



                        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                        Long-Read Sequencing
                        Long-read sequencing has seen remarkable advancements,...
                        12-02-2024, 01:49 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 12-17-2024, 10:28 AM
                      0 responses
                      33 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-13-2024, 08:24 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-12-2024, 07:41 AM
                      0 responses
                      34 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-11-2024, 07:45 AM
                      0 responses
                      46 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X