Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks: a problem with the FPKM ratios?

    Hi all,

    I am getting inaccurate FPKM ratios after cufflinks. The same samples give low correlations of log 2 ratios (see attached scatter plot), which is very worrying.
    Is this a bug in cufflinks or am I missing something?

    Here is what I did:

    I have 2 different biological RNA seq samples, each with huge coverage. I created from each of the single lanes 2 data sets: setA and setB,each consists 30M reads.

    I used cufflinks and cuffdiff to define transcripts, estimates their abundance and calculate differential expression.

    The problem, is that the log2 ratios of the (FPKM of sample 2)/(FPKM of sample 1) in "set A" and in "set B" are not consistent. Even though "set A" and "set B" - are from the same library prep and even from the same lane (ran in HiSeq 2000).

    In setA there are 124 genes up-regulated above 2 fold; in setB there are 120 genes up-regulated above 2 fold, while their intersect is only 76 (!).

    In the attached figure there is a scatter plot for the log2 (sample 1 RFPKM/sample 2 RFPKM) obtained for "set A" versus the same log2ratio obtained from "set B".
    The RFPKM value were taken from the file genes.fpkm_tracking in each of the sets (below is a detail of the commands I used).
    In the scatter plot are shown only genes that had > 5 RPKM and "OK" in the status.

    Did anyone else encounter such problems? Am I going wrong with my workflow?

    Thanks a lot

    --------------------------------------------------
    The commands I used:
    run tophat:
    nohup tophat --segment-mismatches 1 --solexa1.3-quals -o cufflinks_p_sample1_b -p 3 /srv/db/Bowtie/mm9/mm9 sample1_30a.txt
    nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample2_a -p 3 /srv/db/Bowtie/mm9/mm9 sample2_30a.txt
    nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample1_b -p 3 /srv/db/Bowtie/mm9/mm9 sample1_30b.txt
    nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample2_b -p 3 /srv/db/Bowtie/mm9/mm9 sample2_30b.txt

    run cufflinks on each sample separately:

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample1_a -L sample1_a -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample1_30a/accepted_hits.bam -p 2

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample1_b -L sample1_b -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample1_30b/accepted_hits.bam -p 2

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample2_a -L sample2_a-g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample2_30a/accepted_hits.bam -p 2

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample2_b -L sample2_b -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample2_b/accepted_hits.bam -p 2

    in dir setAB run cuffcompare:
    /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffcompare -r /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -V ../cufflinks_p_sample1_a/transcripts.gtf ../cufflinks_p_sample1_b/transcripts.gtf ../cufflinks_p_cufflinks_p_sample2_a/transcripts.gtf ../cufflinks_p_sample2_b/transcripts.gtf

    Now I ran cuffdiff separately for setA and setB (in different directories):

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffdiff -N ../setAB/cuffcmp.combined.gtf ../../cufflinks_p_sample1_a/accepted_hits.bam ../../cufflinks_p_sample2_a/accepted_hits.bam -o cuffdiff_out_a

    nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffdiff -N ../setAB/cuffcmp.combined.gtf ../../cufflinks_p_sample1_b/accepted_hits.bam ../../cufflinks_p_sample2_b/accepted_hits.bam -o cuffdiff_out_b
    Attached Files

  • #2
    Does it make sense to run cufflinks on the total set first to get a combined GTF file, rather than using 4 different ones for each condition?

    Comment


    • #3
      After running the 2 sets I used cuffcompare to combine all together. So cuffdiff was run with the same gtf for setA and setB.

      I also ran cuffdiff for setA and setB on refseq gtf, and did not get consistent log2ratios.

      Comment


      • #4
        ah right, sorry. I misread the command sequence and thought the cuffcompare line was something else (not quite sure what...).

        Comment


        • #5
          Does this also happen if you use the same subsample instead of a different one (e.g. setA vs setA)? I think cufflinks includes some sampling in its FPKM calculations so there'll be small shifts in the values, but it's unlikely to be within the ranges you've demonstrated.

          How are you doing the read subsampling to choose setA/setB? What are the absolute FPKM values for these transcripts (small values will have comparatively larger errors)?

          Comment


          • #6
            Thanks for the replies.
            Very good questions.

            1. I haven't tried giving the same set twice. Anyway - I wouldn't like to have such differences - no matter where they come from.

            2. Regarding the read sampling: I wanted to choose reads randomly, but my script was too heavy. So I took 1st 30M reads and 2nd 30M reads of the fastq file. So I agree it might be problematic. However, when I used the the same sets A and B to get RPKM values using the Partek software - I got beautiful correlations (I can attach scatter plots if anyone is interested). So this indicates that the sets are OK and can give consistent results.

            Comment


            • #7
              Is partek doing any filtering/preprocessing? If so, It seems like the sampling is a key issue. Unless you know the reads are randomly arranged in the fastq, using the halves could explain the differential expression you see where none is expected.

              What of the sampling script was "heavy?"

              Comment


              • #8
                Thanks for the reply.
                I am not sure if Partek does filtering - they say very little in their white pages.
                As Partek gives high correlation I don't think that the sampling is the problem.

                Comment


                • #9
                  did you ever resolve this, or get a response from the authors?

                  Comment


                  • #10
                    Still didn't resolve the problem...

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Recent Advances in Sequencing Analysis Tools
                      by seqadmin


                      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                      05-06-2024, 07:48 AM
                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:57 AM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-06-2024, 07:17 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-02-2024, 08:06 AM
                    0 responses
                    20 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-30-2024, 12:17 PM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X