Hi all,
I am getting inaccurate FPKM ratios after cufflinks. The same samples give low correlations of log 2 ratios (see attached scatter plot), which is very worrying.
Is this a bug in cufflinks or am I missing something?
Here is what I did:
I have 2 different biological RNA seq samples, each with huge coverage. I created from each of the single lanes 2 data sets: setA and setB,each consists 30M reads.
I used cufflinks and cuffdiff to define transcripts, estimates their abundance and calculate differential expression.
The problem, is that the log2 ratios of the (FPKM of sample 2)/(FPKM of sample 1) in "set A" and in "set B" are not consistent. Even though "set A" and "set B" - are from the same library prep and even from the same lane (ran in HiSeq 2000).
In setA there are 124 genes up-regulated above 2 fold; in setB there are 120 genes up-regulated above 2 fold, while their intersect is only 76 (!).
In the attached figure there is a scatter plot for the log2 (sample 1 RFPKM/sample 2 RFPKM) obtained for "set A" versus the same log2ratio obtained from "set B".
The RFPKM value were taken from the file genes.fpkm_tracking in each of the sets (below is a detail of the commands I used).
In the scatter plot are shown only genes that had > 5 RPKM and "OK" in the status.
Did anyone else encounter such problems? Am I going wrong with my workflow?
Thanks a lot
--------------------------------------------------
The commands I used:
run tophat:
nohup tophat --segment-mismatches 1 --solexa1.3-quals -o cufflinks_p_sample1_b -p 3 /srv/db/Bowtie/mm9/mm9 sample1_30a.txt
nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample2_a -p 3 /srv/db/Bowtie/mm9/mm9 sample2_30a.txt
nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample1_b -p 3 /srv/db/Bowtie/mm9/mm9 sample1_30b.txt
nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample2_b -p 3 /srv/db/Bowtie/mm9/mm9 sample2_30b.txt
run cufflinks on each sample separately:
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample1_a -L sample1_a -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample1_30a/accepted_hits.bam -p 2
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample1_b -L sample1_b -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample1_30b/accepted_hits.bam -p 2
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample2_a -L sample2_a-g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample2_30a/accepted_hits.bam -p 2
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample2_b -L sample2_b -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample2_b/accepted_hits.bam -p 2
in dir setAB run cuffcompare:
/usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffcompare -r /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -V ../cufflinks_p_sample1_a/transcripts.gtf ../cufflinks_p_sample1_b/transcripts.gtf ../cufflinks_p_cufflinks_p_sample2_a/transcripts.gtf ../cufflinks_p_sample2_b/transcripts.gtf
Now I ran cuffdiff separately for setA and setB (in different directories):
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffdiff -N ../setAB/cuffcmp.combined.gtf ../../cufflinks_p_sample1_a/accepted_hits.bam ../../cufflinks_p_sample2_a/accepted_hits.bam -o cuffdiff_out_a
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffdiff -N ../setAB/cuffcmp.combined.gtf ../../cufflinks_p_sample1_b/accepted_hits.bam ../../cufflinks_p_sample2_b/accepted_hits.bam -o cuffdiff_out_b
I am getting inaccurate FPKM ratios after cufflinks. The same samples give low correlations of log 2 ratios (see attached scatter plot), which is very worrying.
Is this a bug in cufflinks or am I missing something?
Here is what I did:
I have 2 different biological RNA seq samples, each with huge coverage. I created from each of the single lanes 2 data sets: setA and setB,each consists 30M reads.
I used cufflinks and cuffdiff to define transcripts, estimates their abundance and calculate differential expression.
The problem, is that the log2 ratios of the (FPKM of sample 2)/(FPKM of sample 1) in "set A" and in "set B" are not consistent. Even though "set A" and "set B" - are from the same library prep and even from the same lane (ran in HiSeq 2000).
In setA there are 124 genes up-regulated above 2 fold; in setB there are 120 genes up-regulated above 2 fold, while their intersect is only 76 (!).
In the attached figure there is a scatter plot for the log2 (sample 1 RFPKM/sample 2 RFPKM) obtained for "set A" versus the same log2ratio obtained from "set B".
The RFPKM value were taken from the file genes.fpkm_tracking in each of the sets (below is a detail of the commands I used).
In the scatter plot are shown only genes that had > 5 RPKM and "OK" in the status.
Did anyone else encounter such problems? Am I going wrong with my workflow?
Thanks a lot
--------------------------------------------------
The commands I used:
run tophat:
nohup tophat --segment-mismatches 1 --solexa1.3-quals -o cufflinks_p_sample1_b -p 3 /srv/db/Bowtie/mm9/mm9 sample1_30a.txt
nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample2_a -p 3 /srv/db/Bowtie/mm9/mm9 sample2_30a.txt
nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample1_b -p 3 /srv/db/Bowtie/mm9/mm9 sample1_30b.txt
nohup tophat --segment-mismatches 1 --solexa1.3-quals -o top_part_sample2_b -p 3 /srv/db/Bowtie/mm9/mm9 sample2_30b.txt
run cufflinks on each sample separately:
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample1_a -L sample1_a -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample1_30a/accepted_hits.bam -p 2
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample1_b -L sample1_b -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample1_30b/accepted_hits.bam -p 2
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample2_a -L sample2_a-g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample2_30a/accepted_hits.bam -p 2
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cufflinks -o cufflinks_p_sample2_b -L sample2_b -g /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -u ../top_part_sample2_b/accepted_hits.bam -p 2
in dir setAB run cuffcompare:
/usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffcompare -r /ngs002/user_data/bsgilgi/data/mouse/ucsc_mm9_known_genes.gtf -V ../cufflinks_p_sample1_a/transcripts.gtf ../cufflinks_p_sample1_b/transcripts.gtf ../cufflinks_p_cufflinks_p_sample2_a/transcripts.gtf ../cufflinks_p_sample2_b/transcripts.gtf
Now I ran cuffdiff separately for setA and setB (in different directories):
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffdiff -N ../setAB/cuffcmp.combined.gtf ../../cufflinks_p_sample1_a/accepted_hits.bam ../../cufflinks_p_sample2_a/accepted_hits.bam -o cuffdiff_out_a
nohup /usr/local/src/cufflinks/cufflinks-1.1.0.Linux_x86_64/cuffdiff -N ../setAB/cuffcmp.combined.gtf ../../cufflinks_p_sample1_b/accepted_hits.bam ../../cufflinks_p_sample2_b/accepted_hits.bam -o cuffdiff_out_b
Comment