Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • High values of FPKM on cuffdiff

    Hi all,

    I run cuffdiff on my control vs condition to check if the mRNA expression of the gene I've knocked-down (from Hela cells) is lower than the mRNA expression of my control.

    Gladly I saw that indeed the knock-down sample expresses lower FPKM values of that specific gene. great.

    However, I notice very high values of some microRNA:
    NR_106781 has a value of 1,460,000
    NR_039666 has a value of 995,697
    NR_039828 & NR_002574 & NR_037428 are around 15,000
    and around 15 more tracks are above 4000

    the command line I used:

    cuffdiff -p 6 -b hg19/fasta/ -u --no-update-check -v -L control,condition -o /cuffdiff hg19refseq.gtf control/accepted_hits.bam condition/accepted_hits.bam

    * I used Refseq hg19 downloaded from UCSC table browser

    So these results obviously raises question:

    Why those MicroRNA FPKM are so high ?
    And how could FPKM values be over million considering that FPKM stands for fragments per kilo per million ?

    Anyone ?

  • #2
    miRNAs are going to be less than 1kb, so...

    Comment


    • #3
      Thanks

      But are those values represent a good value ?

      Comment


      • #4
        I haven't a clue what "good" would mean in this context. An expressed microRNA is going to have a high FPKM, so if that's what you mean then yes.

        Comment


        • #5
          That is what I mean

          I wanted to know those values represent a true expression, as it is the first time I get such high values on RNA-seq

          I was worried they might be contamination of some sort and should be removed before cufflinks

          Anyway, those MicroRNA are cancer-related according to NCBI and since we used Hela cells I guess those numbers could be values of true MicroRNA expression.

          Thank you Ryan

          Comment


          • #6
            You've just discovered one of the wonderful quirks of Cufflinks. D)

            During RNA-Seq library preparation, the short RNA molecules (e.g. miRNAs) get filtered out. Given that the majority of the very short reads get filtered out, the Cufflinks programmers assume that any short sequences that do make it through are actually representative of a much larger population. So they decided, somewhat arbitrarily and without properly documenting their decision, to assign extremely high FPKM values to very short sequences even if the number of reads actually aligning to these very short sequences is very low.

            Given that most short sequences are lost during library preparation, the best solution is to simply ignore them in the analysis. If the researcher is interested in small RNAs, he can do smallRNA-Seq which does not include any filtering step (resulting in a lot of junk, but that is another problem).

            Please do not waste any time analyzing the small RNAs in RNA-Seq. I haven't seen any papers analyzing small RNAs from RNA-Seq but I'm sure they must exist. Any paper analyzing small RNAs from RNA-Seq data should be dismissed. I do know countless people who have wasted time trying to make sense of small RNA counts from Cufflinks results.

            I just add the gene biotypes with BioMart to my FPKM counts so that researchers can identify the small RNAs (miRNAs, snoRNAs, ...) and know to treat the counts with extreme caution.

            The htseq-count and DESeq pipeline does not have this issue. Ultimately, actually examining the alignment file in IGV or the UCSC genome browser is always the best solution for individual genes.

            Here is the full justification from Cole Trapnell. I should say that he did take the time to post on seqanswers.com. I do like his software and all the work the team has put into Cufflinks, even though I may appear to be a bit frustrated with some of their opaque decision making process regarding FPKM values in this post.

            "This issue has been discussed elsewhere on this board. As Nicholas points out, RNA-Seq really isn't reliable for very short transcripts. The reason is that all the fragments that map to these transcripts come from the "tail" of the distribution of library fragment lengths. That is, fragments that map to microRNAs are much, much shorter than most fragments in the library - by design in the RNA-Seq protocol, which size selects away very short inserts. Thus, Cufflinks infers that even though relatively few fragments actually mapped to the microRNAs, there were probably TONS of individual microRNA molecules in the transcriptome before all of the various size selection parts of the protocol kicked in. Cufflinks accordingly increases the FPKM of these short transcripts to compensate for the bias against short fragments in the library.

            This compensation was designed to improve accuracy for transcripts that are in the 500bp-1kb range - for longer transcripts, the "edge effects" due to library fragment size aren't much of an issue. However, I wouldn't trust FPKM values for transcripts shorter than your average fragment length. There's really just not enough data in most standard RNA-Seq libraries to say much about small RNA abundance.

            I should also point out that other methods use this same bias correction technique (RSEM for example). As far as I'm aware, the "count-based" methods don't, but that doesn't mean they shouldn't. Most of those methods are strictly for differential analysis, where any edge effects are assumed to be affecting each condition the same way. That may or may not be the case in your data.

            In any case, the quick answer to this problem is to simply remove or ignore transcripts shorter than around 300bp from your GTF. In a future version, we will be flagging these transcripts as too short for reliable quantification where appropriate."
            Cole Trapnell
            Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

            Comment


            • #7
              Oh my

              I think you both won the discussion

              I guess everyone agrees this values are just to say - be aware that those Mirs are highly expressed, but don't assume the numbers are reliable.

              Thanks a lot blancha !

              Comment


              • #8
                Is there any way of turning off this feature? I noticed the options

                --no-effective-length-correction

                and

                --no-length-correction

                in the cufflinks manual, which may or may not do this. Though I am not entirely sure if that truly is what those options are intended to do (and in that case, which of the two options I should use). I have "regular" RNA-seq data (non-microRNA-specific sequencing), but the sequencing company claimed they had not done any fragment size selection. Therefore, I suspect that it would be better to turn off this "correction".

                Comment


                • #9
                  For what I understand --no-length-correction is just the FPM out of FPKM

                  If no fragmentation was done, no need to add the per Kilo-base
                  Last edited by Ohad; 05-09-2014, 04:28 AM.

                  Comment


                  • #10
                    Originally posted by Ohad View Post
                    For what I understand --no-length-correction is just the FPM out of FPKM

                    If no fragmentation was done, no need to add the per Kilo-base
                    I'm pretty sure fragmentation was done, just no size selection on those fragments.

                    Comment


                    • #11
                      I would double-check that there was no "size selection".
                      Small RNAs are filtered out during the standard RNA-Seq library preparation protocol.
                      I've verified this with the technician who prepares our samples.
                      There is just no specific step in the protocol lapelled size selection, so the "sequencing company" may not even be aware that most small RNAs were removed, and may not inform the customers on the impact on the downstream bioinformatics analysis.
                      I've had terrible correlation between replicates on RNA-Seq results for very short reads so I've learnt to disregard RNA-Seq results for short RNA sequences.

                      Comment


                      • #12
                        Originally posted by blancha View Post
                        I would double-check that there was no "size selection".
                        Small RNAs are filtered out during the standard RNA-Seq library preparation protocol.
                        I've verified this with the technician who prepares our samples.
                        There is just no specific step in the protocol lapelled size selection, so the "sequencing company" may not even be aware that most small RNAs were removed, and may not inform the customers on the impact on the downstream bioinformatics analysis.
                        I've had terrible correlation between replicates on RNA-Seq results for very short reads so I've learnt to disregard RNA-Seq results for short RNA sequences.
                        You may be right. The company in question did at first make a false claim that the library preparation method they used was not strand specific (which I have later confirmed it to be), so I would not be surprised if they are wrong about this as well. Though what makes me believe that there was indeed no size selection is that fact that my average mate inner distance (TopHat "-r" option, inferred with RSeQC (and independently by calculations on TLEN in the sam files from alignment)) is -30. So at least most insert sizes must have been really small.

                        Comment


                        • #13
                          * What I meant was --no-length-correction should be used WHEN no fragmentation was done. Sorry for the confusion

                          NOObseq, you should add the reads lengths of both mates themselves to that -30 and view the distribution around the AVG to spot weather size selection took place, and keep in mind that Tophat may have not include in your SAM bigger fragments as it labeled them as "not proper pair"

                          Comment


                          • #14
                            Originally posted by Ohad View Post
                            * What I meant was --no-length-correction should be used WHEN no fragmentation was done. Sorry for the confusion

                            NOObseq, you should add the reads lengths of both mates themselves to that -30 and view the distribution around the AVG to spot weather size selection took place, and keep in mind that Tophat may have not include in your SAM bigger fragments as it labeled them as "not proper pair"
                            OK, thank you for clarifying. In order to get the fragment length I would then add 2*100, yielding an average fragment size of 170. This is less than the sum of the read lengths (2*100), and thus the reads overlap. The cause for this overlap would be a too short insert size (or analogously, fragment size). My distribution of insert sizes is shown in the attached file. Compare with the "typical" distribution from the RSeQC manual http://rseqc.sourceforge.net/#inner-distance-py.

                            I don't know if TopHat has excluded a lot of longer fragments, I may need to look into that. But the distribution looks suspicious enough as it is. I also get lots of artifacts ("novel" isoforms) in cufflinks, which is another reason to take a closer look at these bias correction parameters.
                            Attached Files

                            Comment


                            • #15
                              For my taste your distribution looks fine, and I think that an avg of 170 is fine as well.
                              I don't understand why novel transcripts are suspicious to you regarding fragment distribution.

                              I suggest that you post your questions in a new thread for the purpose of future searches people do.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X