Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unusually high FPKM for cufflinks

    Hi,

    I have been working with Cufflinks v. 1.1.0 with mRNA-Seq data from some of the old (2008) runs of Illumina with 36bp reads. The only options I specified were (-I 5000 and -b <refSeqFasta>) and there was no reference GFF specified. In the resulting transcripts.gtf, I'm getting unusually high FPKM values on the scale of tens to hundreds of thousands (eg: FPKM=83456.4, 5571.5, 1017907.8) for several thousand transcripts.

    Some previous posts had suggested short read length and reference FASTA as possible culprits. But, removing the -b option does not help. This is not a problem with the BAM format since SAM format also gives similar result. I tried the newer v. 1.3.0 and that too gives similar values. I'm not sure if short transcripts are being consistently inflated.

    Strangely, the older v. 0.9.3 is giving respectable FPKM values (455.4 for the transcript that had 83456.4 previously), which I'd like to trust since they match manually calculated values (not quite, but close).

    However, I wonder why the new versions of Cufflinks are inflating the FPKM values by several orders of magnitude? Has anyone found a solution to this problem? Can I still use the new versions without causing such FPKM inflation?

    Thanks
    Last edited by flobpf; 02-13-2012, 01:21 PM.

  • #2
    High FPKM for small transcripts

    It is indeed true that small Cufflinks transcripts tend to have significantly inflated FPKMs. Anyone else seeing this?

    Last edited by flobpf; 02-13-2012, 02:34 PM.

    Comment


    • #3
      Hi,

      Which mode did you use Cufflinks with? with a reference file, in RABT mode, or de-novo?

      Comment


      • #4
        Originally posted by Nicolas View Post
        Hi,

        Which mode did you use Cufflinks with? with a reference file, in RABT mode, or de-novo?
        Hi Nicholas,

        I used the ~RABT mode with single-end reads. The reads were first mapped to reference genome using TopHat and Cufflinks was run on accepted_hits.bam file. However, reference GTF was not specified.
        Last edited by flobpf; 02-15-2012, 08:08 AM. Reason: Not exactly RABT, not exactly denovo

        Comment


        • #5
          Yes, we see this as well (and other groups I have spoken to). It's pretty consistent from run to run.

          Comment


          • #6
            A note about small transcripts and high FPKM: The reason you're seeing this is that with a very small transcript, the fragments that map to it have to be short (at least as short as the transcript), and thus often come from the tail of the library's fragment length distribution. What I mean by this is that if you plot a histogram of the length of each library fragment, there's usually a mean around 200-250 bp (depending on the protocol, and excluding adapters). Most fragments aren't much larger or much smaller than that - i.e. the variance is very small. However, there are a small fraction of fragments that are super short (100bp or even smaller) or quite long (500-600bp). Because these are rare, Cufflinks reasons that for the small transcript to have generated them, it must be very very abundant. In fact, it probably generated many many more fragments, most of which didn't make it through all of the size selection steps during library construction. So we "upscale" the FPKM to account for this effect. You can read about this correction in the supplement of the Cufflinks paper. The reason for the change between 1.1.0 and 0.9.3 is that there were some problems in the actual implementation of the correction in 0.9.3, and we fixed them in later versions.

            While the correction (in our opinion) is good thing to do, the bottom line is that standard RNA-Seq is really not the right assay for measuring small RNA expression, because the very nature of the size selection introduces a lot of error and variability in the sampling of fragments from these species. I'm actually considering adding another status flag (similar to HIDATA, FAIL, etc) to warn users that their library is too large for reliable quantification of a particular transcript.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            17 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Working...
            X