Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • epi
    Member
    • Jan 2012
    • 38

    Cufflinks FPKM range

    I am observing high FPKMs for cufflinks result, as many of you.
    After going through the literature, it seems small genes and upper-quartile normalizations may be involved. While I am finding this to be true, but I am also finding high FPKMs for some other genes as high as ~4000 (3.5 Kb gene). I have 100s such genes in the dataset. I visualized few in IGV, they have large no. of reads, but certainly not as high as the FPKM says.

    Please comments on my questions as much as you can.

    1. Have you observed such cases, what could be the reason for these.

    2. What is the normal range of FPKMs observed, is there a normal range?

    3. What to do with small novel genes which cufflink finds, should just ignore it. Is there any command line settings to prevent it.

    4. For non-novel genes (from GTF annonation) with such high FPKMs, would you ignore those for cuffdiff or include it.

    Thank you for responding
  • savova
    Junior Member
    • Aug 2011
    • 3

    #2
    I need an answer to this too...

    Comment

    • sdriscoll
      I like code
      • Sep 2009
      • 436

      #3
      FPKMS are simply rate measurements. You could have a gene with an FPKM of 100 that only got 20 reads. It all depends on that last part of the normalization: per million mapped reads.

      There is no logical bottom end cutoff for FPKM where you can say "these genes are not expressed", other than 0 of course.

      If you mean that most of the genes in your results seem right bu a subset of them seem to have higher FPKMS than others with similar amounts of coverage then you're probably seeing an artifact from the cufflinks pipeline. I have seen that many times myself for small genes like those single exon ones. It doesn't make much sense. I recommend trying the -b option on cufflinks and/or cuffdiff. That uses the bias correction pipeline within cufflinks and it seems to fix those erroneous FPKMS.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment

      • savova
        Junior Member
        • Aug 2011
        • 3

        #4
        I have a different problem - all my RPKM values in one dataset are shifted by 10,000 with respect to another! Both are described to have been prepared the same way. I am pasting my message to cufflinks developers:

        I wanted to compare this dataset

        to available Encode datasets on other cell lines:


        My expression analysis with Cufflinks is weird. In particular, it seems that the
        whole RPKM distribution is shifted up for the first dataset samples (HMEC and
        HCC1954) . For example, the minimum of both HMEC and H1HESC is 0, but the maximum
        is 3*10^9 and 3*10^4 respectively. So in log space, the average RPKM for
        the other cell lines is around 2-3, while for HMEC and HCC1955 it's 10-12. At this
        point I went all the way back to fastq, realigned to Hg19 with bowtie,
        and used cufflinks to compute RPKM - the difference remains. Any ideas why?

        It is true that one library may have more reads. But isn't FPKM supposed to normalize for the number of total reads in the library and if so how can the entire distribution be shifted?

        2) On another note, I also do not understand how I am getting some really small non-zero values from both datasets when the total number of reads would not seem to permit this:

        total reads HMEC_expression:
        2.2983e+10

        min HMEC_expression >0
        3.0939e-312


        I would really appreciate your help.

        Comment

        • sdriscoll
          I like code
          • Sep 2009
          • 436

          #5
          I've seen cuffdiff blow the read count normalizations but not cufflinks. In my case I saw a 10 fold increase in the baseline of one group's mean expression verses the other causing almost all genes to be tagged as significantly misexpressed.

          Have you tried testing the different normalization options that Cufflinks provides? Have you tried the --compatible-hits-norm option or the -N option for upper quartile normalization.

          You can also look in the isoforms.fpkm_tracking files and check the "length" and "coverage" columns. You can roughly compute the number of raw reads aligned to each gene by multiplying those columns together. Sum the column of products to get a rough "total bases aligned to genes" count and divide the column by that number to roughly normalize the counts. Try that at each sample and see if you still have that massive offset between samples.
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment

          • savova
            Junior Member
            • Aug 2011
            • 3

            #6
            thanks, i will try this. but I am now worried this software works erratically. do you have any idea why such blowing of the normalization occurs? can i trust results from other people computed with this software?

            Comment

            • sdriscoll
              I like code
              • Sep 2009
              • 436

              #7
              I don't use it as my primary quantification tool nor my primary differential expression tool. I've never seen DESeq or edgeR blow the normalization step. We are only talking about a division step so it doesn't make sence for any software to mess it up. To me Cufflinks is very desirable but I don't trust it so I don't use it. I have explored it quite a lot though because I very much want to be able to use it.

              In your case it COULD be a result of the normalization being based on total reads aligned instead of the more robust upper quartile method. But you should check the coverages to make sure. If your manual normalizations give you the same result then you've got some small population of highly expressed genes biasing the normalization. The -N option should fix that or normalizing by the upper quartile of the read counts of the genes. I'd also try the -b option because it seems to help fix some other things that Cufflinks does that make me not trust it. I still dont trust it though. Maybe im just not smart enough to understand it.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment

              • caballien
                Junior Member
                • Nov 2012
                • 7

                #8
                very low fpkm?

                sdriscoll-

                Nearly all of my fpkm values are very low. The median of all of my replicates is ~0.1 and I have between 50 and 60 million mapped reads per sample. Very few genes are above 10. See the attached graph boxplot2.pdf and testdensity.pdf. Are these values too low, or as you said caused by a larger denominator and thus are okay? Also, I've attached a .pdf of a volcano plot, which is strange because I have ~870 significantly differentially expressed genes, but they all show up at the top of the graph where they don't belong (pvalues are not that small). Perhaps cummeRbund is just doing something improperly.

                The sequencing is from RNA-seq from ribosomal depleted RNA, could this lower the fpkms? I did mask all repetitive regions when using cuffdif.

                The sequencing was performed on a HiSeq. The data was generated through the Tuxedo package -Tophat 2, cufflinks,cuffmerge,cummeRbund.
                Attached Files

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Pathogen Surveillance with Advanced Genomic Tools
                  by seqadmin




                  The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                  03-24-2025, 11:48 AM
                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 10:17 AM
                0 responses
                7 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-20-2025, 05:03 AM
                0 responses
                49 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-19-2025, 07:27 AM
                0 responses
                59 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                50 views
                0 reactions
                Last Post seqadmin  
                Working...