Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with FPKM values

    Hi,

    I see problems with the fpkm values calculated by cufflinks program. I dont see correlation between counts and fpkms for some genes.

    To be more clear I explain my scenario
    I have RNA-SEQ data for two conditions (let say condA and condB) with 3 replicates each
    for a particular gene : say gene Tagap

    Cond A:
    Type Rep1 Rep2 Rep3 mean_of_replicates
    Count 169 171 124 154
    fpkms 4.44 3.8 3.2 3.8

    Cond B:
    Type Rep1 Rep2 Rep3 mean_of_replicates
    Count 17 38 21 25
    fpkms 5.0 4.7 4.3 4.6

    From above :
    CondA (counts) > CondB (counts)
    CondA (fpkms) < CondB (fpkms)


    I tried to understand the fpkm calculations from cufflinks supplementary methods, but I couldnt completely get their calculations. After googling abit, I thought I can use this simple formulae to do normalization

    Nc = (10^9*Counts)/(Transcript_length*Total no. of uniquely mapped reads)
    with this formulae, I made use of counts to make normalizations and got below values

    Cond A:
    Type Rep1 Rep2 Rep3 mean_of_replicates
    Nc 1.72 1.88 1.26 1.62

    Cond B:
    Type Rep1 Rep2 Rep3 mean_of_replicates
    Nc 0.18 0.4 0.26 0.28


    CondA (Nc) > CondB (Nc)

    Nc values correlates with the counts.

    Can I use this approach (Nc calculations) to get the normalized values instead of FPKMs calculated from cufflinks?

    Why the direction of fold change is completely reversed by cufflinks fpkms?

  • #2
    Which version of Cufflinks are you using?
    Also, what is the total number of uniquely mapped reads for each sample?
    Have you tried comparing your results with the output of featureCounts and DESeq2?

    Comment


    • #3
      Thanks for your reply!
      Originally posted by blancha View Post
      Which version of Cufflinks are you using?
      Also, what is the total number of uniquely mapped reads for each sample?
      Have you tried comparing your results with the output of featureCounts and DESeq2?
      Which version of Cufflinks are you using?
      cufflinks/2.1.1 is used on bam files to calculate the fpkms

      Also, what is the total number of uniquely mapped reads for each sample?
      This number i got from this command
      samtools view -F 4 accepted_hits_sorted_dupRemoved_P1827_101.bam | wc -l
      31837064 + 0 in total (QC-passed reads + QC-failed reads)
      0 + 0 duplicates
      31837064 + 0 mapped (100.00%:-nan%)
      0 + 0 paired in sequencing
      0 + 0 read1
      0 + 0 read2
      0 + 0 properly paired (-nan%:-nan%)
      0 + 0 with itself and mate mapped
      0 + 0 singletons (-nan%:-nan%)
      0 + 0 with mate mapped to a different chr
      0 + 0 with mate mapped to a different chr (mapQ>=5)

      Have you tried comparing your results with the output of featureCounts and DESeq2?
      No, But the counts I used for comparision is from htSeqCount package

      Comment


      • #4
        So you used different bam file for cufflinks and htseqcount? accepted_hits.bam for cufflinks and accepted_hits_sorted_dupRemoved_P1827_101.bam for htseqcount?

        Comment


        • #5
          I would just start by updating Cufflinks.
          I have noticed many bugs with older versions of Cufflinks, which often result in incorrect results.
          I wonder in fact how much of an impact the Cufflinks bugs have had on genomic research given its very widespread use.
          I'd be curious to know if your problem disappears with the latest version of Cufflinks.

          Comment


          • #6
            Originally posted by blancha View Post
            I would just start by updating Cufflinks.
            I have noticed many bugs with older versions of Cufflinks, which often result in incorrect results.
            I wonder in fact how much of an impact the Cufflinks bugs have had on genomic research given its very widespread use.
            I'd be curious to know if your problem disappears with the latest version of Cufflinks.
            Thanks for the suggestion! As the final output tables for htseq and cufflinks are delivered for us by sequencing facility, I am not sure which files they used as input for htseqcount and cufflinks (with duplicates or without), now I am rerunning all the results of Cufflinks and htSeq-Count with duplicated_removed_bam files and update you if I overcome the issue.

            Comment


            • #7
              Yes, please let me know if the issue is resolved.
              I would not remove the duplicates, though.

              Comment


              • #8
                Originally posted by blancha View Post
                Yes, please let me know if the issue is resolved.
                I would not remove the duplicates, though.
                Dear blancha,
                I rerun htseq-count (0.6.1) and cufflinks(version 2.2.1) on my RNA-Seq samples without removing the duplicates. But I see the same results now, without much changes

                First 3 values from Condition A with 3 replicates (red) and last 3 from Condition B with 3 replicates (Green)
                Counts
                ENSMUSG00000033450 Tagap 169 171 124 17 38 21
                FPKMS
                ENSMUSG00000033450 Tagap 4.43869 3.79506 3.20408 4.99311 4.66142 4.2739


                Counts = Cond A (mean) > Cond B (mean)
                FPKMs = Cond A (mean) < Cond B (mean)

                Actually this is not for all the genes, for fewer genes (say 15 genes as far I noticed)


                Commands used for htseq and cufflinks:

                samtools sort $raw_dir/$sam $bam_dir/$nnm'.sorted'

                cufflinks --library-type fr-firststrand -p 8 -G $main_dir/data/Mus_musculus.GRCm38.82.gtf -M $main_dir/data/mask_MR.gtf $bam_dir/$nnm'.sorted.bam' -o $cuff_dir

                samtools view $bam_dir/$nnm'.sorted.bam' | htseq-count -s reverse - $main_dir/data/Mus_musculus.GRCm38.82.gtf> $ht_dir/$nnm"_count.txt"

                Comment


                • #9
                  Have a look at the BAM files in IGV. It's likely that the files with low raw counts have a bunch of multimappers that htseq-count is (correctly, given its purpose) excluding while cufflinks (possibly correctly, given its purpose) is including. You should be able to tell by eye whether cufflinks is being reasonable or not.

                  Comment


                  • #10
                    Yes, @dpryan has a point.

                    @priya, I just want to point out that Cufflinks doesn't provide you only with the FPKM values, but also with the raw counts.
                    I initially thought you were giving me the raw counts for Cufflinks, and that there was a discrepancy between the raw counts given by Cufflinks and the FPKM given by Cufflinks.

                    If there is a difference between the counts for htseq-count and the counts for Cufflinks, this is a different matter altogether, and not necessarily indicative of a bug, since they may count reads differently, as pointed out by @dpryan.

                    The file "genes.count_tracking" contains the counts for Cuffdiff.

                    As pointed out by @dpryan, the simplest solution is always to check the reads in the BAM file yourself with IGV, if you are getting different counts with different software programs.

                    I hope I have not given you any misleading information
                    Last edited by blancha; 10-13-2015, 12:58 PM.

                    Comment


                    • #11
                      Thanks dpryan and blancha for your suggestions!

                      @dpryan is correct, I noticed it when I viewed my bam files in IGV, actually its issue with multiple mapped reads for low raw counts.

                      @blancha: My aim of the analysis is to identify differentially expressed genes between wild-type and knockout lines, after going through this old posts from seqanswers
                      http://seqanswers.com/forums/showthread.php?t=9129
                      http://seqanswers.com/forums/showthread.php?t=5793
                      Two ways - htseqcount -> DE-tests (DESEq/EdgeR/Voom)
                      Cufflinks -> Cuffdiff
                      I stick to the first approach and also in my case, problem arises with lower counts and genes with higher counts has no issues (even i compare htseq_counts and fpkms) ; so for downstream analysis I am considering to eliminate low counts with considerable cutoff and perform differential expression analysis.

                      Another question is ;
                      If we want to compare htseq_counts between two different genes, we need to normalize by gene-lengths ; so can I do by this formulae: (my beginning question of this post)
                      for gene i; normalized count (Nc)
                      Nci = (htseq_count)i * 10^9/ (gene length * Total reads in the sample)

                      so by this way both gene lengths and sample sizes are corrected (If i am not wrong??)

                      Comment


                      • #12
                        So that will sort of but not really make things comparable. There are a couple issues:

                        1) Unless there's only 1 isoform for each of the genes, there's always the question of what the most appropriate gene length is. In an ideal world you'd quantify things at the transcript level and then have an "effective gene length" to use, but that sort of thing takes a bit more work.

                        2) Typically things like GC content affect what gets sequenced, so you end up needing to correct for that as well.

                        3) Inevitably there are things not listed here that also need to be accounted for.

                        The safest method would be to try to do sequencing with some spike-ins that are similar so you can determine how to correct everything, but that'll take a significant time and resource commitment.

                        Comment


                        • #13
                          Good to know about those points! Thanks !

                          Comment


                          • #14
                            @priya

                            The most recent versions of DESeq2 have a function fpkm() which will compute the FPKM values.
                            You do need to provide the gene lengths yourself.
                            If you use featureCounts to count the reads, it will compute the gene lengths from the GTF file.
                            The computations performed by fpkm() are very similar to your formula, but there are some differences explained in the vignette.

                            The results for the lower count genes are always less reliable, regardless of the program used. The slight differences in the algorithms used to count the reads will also be magnified in lower count genes.

                            Comment


                            • #15
                              Originally posted by blancha View Post
                              @priya

                              The most recent versions of DESeq2 have a function fpkm() which will compute the FPKM values.
                              You do need to provide the gene lengths yourself.
                              If you use featureCounts to count the reads, it will compute the gene lengths from the GTF file.
                              The computations performed by fpkm() are very similar to your formula, but there are some differences explained in the vignette.

                              The results for the lower count genes are always less reliable, regardless of the program used. The slight differences in the algorithms used to count the reads will also be magnified in lower count genes.
                              @blancha Thanks for bringing into my notice about the latest version, its just updated yesterday .. (https://www.bioconductor.org/package...man/DESeq2.pdf)
                              I see there are lot of new functions added into DESeq2 , I need to explore more of it esp. fpkms() and normalizeGeneLength() .

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X