Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks: a problem with the FPKM ratios?

    Hi all,

    I am getting inaccurate FPKM ratios after cufflinks. The same samples give low correlations of log 2 ratios (see attached scatter plot), which is very worrying.
    Is this a bug in cufflinks or am I missing something?

    Here is what I did:

    I have 2 different biological RNA seq samples, each with huge coverage. I created from each of the single lanes 2 data sets: setA and setB,each consists 30M reads.

    I used cufflinks and cuffdiff to define transcripts, estimates their abundance and calculate differential expression.

    The problem, is that the log2 ratios of the (FPKM of sample 2)/(FPKM of sample 1) in "set A" and in "set B" are not consistent. Even though "set A" and "set B" - are from the same library prep and even from the same lane (ran in HiSeq 2000).

    In setA there are 124 genes up-regulated above 2 fold; in setB there are 120 genes up-regulated above 2 fold, while their intersect is only 76 (!).

    In the attached figure there is a scatter plot for the log2 (sample 1 RFPKM/sample 2 RFPKM) obtained for "set A" versus the same log2ratio obtained from "set B".
    The RFPKM value were taken from the file genes.fpkm_tracking in each of the sets (below is a detail of the commands I used).
    In the scatter plot are shown only genes that had > 5 RPKM and "OK" in the status.

    Did anyone else encounter such problems? Am I going wrong with my workflow?

    Thanks a lot

    --------------------------------------------------
    The commands I used:
    run tophat:
    nohup tophat --segment-mismatches 1 --solexa1.3-quals -o cufflinks_p_sample1_b -p 3 /srv/db/Bowtie/mm9/mm9 sample1_30a.txt
    nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample2_a -p 3 /srv/db/Bowtie/mm9/mm9 sample2_30a.txt
    nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample1_b -p 3 /srv/db/Bowtie/mm9/mm9 sample1_30b.txt
    nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample2_b -p 3 /srv/db/Bowtie/mm9/mm9 sample2_30b.txt

    run cufflinks on each sample separately:

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample1_a -L sample1_a -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample1_30a/accepted_hits.bam -p 2

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample1_b -L sample1_b -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample1_30b/accepted_hits.bam -p 2

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample2_a -L sample2_a-g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample2_30a/accepted_hits.bam -p 2

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample2_b -L sample2_b -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample2_b/accepted_hits.bam -p 2

    in dir setAB run cuffcompare:
    /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffcompare -r /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -V ../cufflinks_p_sample1_a/transcripts.gtf ../cufflinks_p_sample1_b/transcripts.gtf ../cufflinks_p_cufflinks_p_sample2_a/transcripts.gtf ../cufflinks_p_sample2_b/transcripts.gtf

    Now I ran cuffdiff separately for setA and setB (in different directories):

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffdiff -N ../setAB/cuffcmp.combined.gtf ../../cufflinks_p_sample1_a/accepted_hits.bam ../../cufflinks_p_sample2_a/accepted_hits.bam -o cuffdiff_out_a

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffdiff -N ../setAB/cuffcmp.combined.gtf ../../cufflinks_p_sample1_b/accepted_hits.bam ../../cufflinks_p_sample2_b/accepted_hits.bam -o cuffdiff_out_b
    Attached Files

  • #2
    Does it make sense to run cufflinks on the total set first to get a combined GTF file, rather than using 4 different ones for each condition?

    Comment


    • #3
      After running the 2 sets I used cuffcompare to combine all together. So cuffdiff was run with the same gtf for setA and setB.

      I also ran cuffdiff for setA and setB on refseq gtf, and did not get consistent log2ratios.

      Comment


      • #4
        ah right, sorry. I misread the command sequence and thought the cuffcompare line was something else (not quite sure what...).

        Comment


        • #5
          Does this also happen if you use the same subsample instead of a different one (e.g. setA vs setA)? I think cufflinks includes some sampling in its FPKM calculations so there'll be small shifts in the values, but it's unlikely to be within the ranges you've demonstrated.

          How are you doing the read subsampling to choose setA/setB? What are the absolute FPKM values for these transcripts (small values will have comparatively larger errors)?

          Comment


          • #6
            Thanks for the replies.
            Very good questions.

            1. I haven't tried giving the same set twice. Anyway - I wouldn't like to have such differences - no matter where they come from.

            2. Regarding the read sampling: I wanted to choose reads randomly, but my script was too heavy. So I took 1st 30M reads and 2nd 30M reads of the fastq file. So I agree it might be problematic. However, when I used the the same sets A and B to get RPKM values using the Partek software - I got beautiful correlations (I can attach scatter plots if anyone is interested). So this indicates that the sets are OK and can give consistent results.

            Comment


            • #7
              Is partek doing any filtering/preprocessing? If so, It seems like the sampling is a key issue. Unless you know the reads are randomly arranged in the fastq, using the halves could explain the differential expression you see where none is expected.

              What of the sampling script was "heavy?"

              Comment


              • #8
                Thanks for the reply.
                I am not sure if Partek does filtering - they say very little in their white pages.
                As Partek gives high correlation I don't think that the sampling is the problem.

                Comment


                • #9
                  did you ever resolve this, or get a response from the authors?

                  Comment


                  • #10
                    Still didn't resolve the problem...

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 08:47 AM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    59 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    54 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X