Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • epi
    Member
    • Jan 2012
    • 38

    Cufflinks FPKM range

    I am observing high FPKMs for cufflinks result, as many of you.
    After going through the literature, it seems small genes and upper-quartile normalizations may be involved. While I am finding this to be true, but I am also finding high FPKMs for some other genes as high as ~4000 (3.5 Kb gene). I have 100s such genes in the dataset. I visualized few in IGV, they have large no. of reads, but certainly not as high as the FPKM says.

    Please comments on my questions as much as you can.

    1. Have you observed such cases, what could be the reason for these.

    2. What is the normal range of FPKMs observed, is there a normal range?

    3. What to do with small novel genes which cufflink finds, should just ignore it. Is there any command line settings to prevent it.

    4. For non-novel genes (from GTF annonation) with such high FPKMs, would you ignore those for cuffdiff or include it.

    Thank you for responding
  • savova
    Junior Member
    • Aug 2011
    • 3

    #2
    I need an answer to this too...

    Comment

    • sdriscoll
      I like code
      • Sep 2009
      • 436

      #3
      FPKMS are simply rate measurements. You could have a gene with an FPKM of 100 that only got 20 reads. It all depends on that last part of the normalization: per million mapped reads.

      There is no logical bottom end cutoff for FPKM where you can say "these genes are not expressed", other than 0 of course.

      If you mean that most of the genes in your results seem right bu a subset of them seem to have higher FPKMS than others with similar amounts of coverage then you're probably seeing an artifact from the cufflinks pipeline. I have seen that many times myself for small genes like those single exon ones. It doesn't make much sense. I recommend trying the -b option on cufflinks and/or cuffdiff. That uses the bias correction pipeline within cufflinks and it seems to fix those erroneous FPKMS.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment

      • savova
        Junior Member
        • Aug 2011
        • 3

        #4
        I have a different problem - all my RPKM values in one dataset are shifted by 10,000 with respect to another! Both are described to have been prepared the same way. I am pasting my message to cufflinks developers:

        I wanted to compare this dataset

        to available Encode datasets on other cell lines:


        My expression analysis with Cufflinks is weird. In particular, it seems that the
        whole RPKM distribution is shifted up for the first dataset samples (HMEC and
        HCC1954) . For example, the minimum of both HMEC and H1HESC is 0, but the maximum
        is 3*10^9 and 3*10^4 respectively. So in log space, the average RPKM for
        the other cell lines is around 2-3, while for HMEC and HCC1955 it's 10-12. At this
        point I went all the way back to fastq, realigned to Hg19 with bowtie,
        and used cufflinks to compute RPKM - the difference remains. Any ideas why?

        It is true that one library may have more reads. But isn't FPKM supposed to normalize for the number of total reads in the library and if so how can the entire distribution be shifted?

        2) On another note, I also do not understand how I am getting some really small non-zero values from both datasets when the total number of reads would not seem to permit this:

        total reads HMEC_expression:
        2.2983e+10

        min HMEC_expression >0
        3.0939e-312


        I would really appreciate your help.

        Comment

        • sdriscoll
          I like code
          • Sep 2009
          • 436

          #5
          I've seen cuffdiff blow the read count normalizations but not cufflinks. In my case I saw a 10 fold increase in the baseline of one group's mean expression verses the other causing almost all genes to be tagged as significantly misexpressed.

          Have you tried testing the different normalization options that Cufflinks provides? Have you tried the --compatible-hits-norm option or the -N option for upper quartile normalization.

          You can also look in the isoforms.fpkm_tracking files and check the "length" and "coverage" columns. You can roughly compute the number of raw reads aligned to each gene by multiplying those columns together. Sum the column of products to get a rough "total bases aligned to genes" count and divide the column by that number to roughly normalize the counts. Try that at each sample and see if you still have that massive offset between samples.
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment

          • savova
            Junior Member
            • Aug 2011
            • 3

            #6
            thanks, i will try this. but I am now worried this software works erratically. do you have any idea why such blowing of the normalization occurs? can i trust results from other people computed with this software?

            Comment

            • sdriscoll
              I like code
              • Sep 2009
              • 436

              #7
              I don't use it as my primary quantification tool nor my primary differential expression tool. I've never seen DESeq or edgeR blow the normalization step. We are only talking about a division step so it doesn't make sence for any software to mess it up. To me Cufflinks is very desirable but I don't trust it so I don't use it. I have explored it quite a lot though because I very much want to be able to use it.

              In your case it COULD be a result of the normalization being based on total reads aligned instead of the more robust upper quartile method. But you should check the coverages to make sure. If your manual normalizations give you the same result then you've got some small population of highly expressed genes biasing the normalization. The -N option should fix that or normalizing by the upper quartile of the read counts of the genes. I'd also try the -b option because it seems to help fix some other things that Cufflinks does that make me not trust it. I still dont trust it though. Maybe im just not smart enough to understand it.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment

              • caballien
                Junior Member
                • Nov 2012
                • 7

                #8
                very low fpkm?

                sdriscoll-

                Nearly all of my fpkm values are very low. The median of all of my replicates is ~0.1 and I have between 50 and 60 million mapped reads per sample. Very few genes are above 10. See the attached graph boxplot2.pdf and testdensity.pdf. Are these values too low, or as you said caused by a larger denominator and thus are okay? Also, I've attached a .pdf of a volcano plot, which is strange because I have ~870 significantly differentially expressed genes, but they all show up at the top of the graph where they don't belong (pvalues are not that small). Perhaps cummeRbund is just doing something improperly.

                The sequencing is from RNA-seq from ribosomal depleted RNA, could this lower the fpkms? I did mask all repetitive regions when using cuffdif.

                The sequencing was performed on a HiSeq. The data was generated through the Tuxedo package -Tophat 2, cufflinks,cuffmerge,cummeRbund.
                Attached Files

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM
                • SEQadmin2
                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                  by SEQadmin2


                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                  Introduction

                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                  05-22-2026, 06:42 AM
                • SEQadmin2
                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                  by SEQadmin2

                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                  05-06-2026, 09:04 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, Today, 08:59 AM
                0 responses
                10 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                21 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 11:40 AM
                0 responses
                17 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-28-2026, 11:40 AM
                0 responses
                31 views
                0 reactions
                Last Post SEQadmin2  
                Working...