I am performing differential expression analysis of three different experimental conditions, each of which has 4 biological replicates, and was trying to do this using cuffdiff. We are really only interested initially in looking at currently annotated genes. I am using the Ensembl genome and gtf files from iGenomes.
STEP 1: tophat
I initially (after quality filtering fastq files) ran tophat with the following command:
For the first run (to make a transcriptome index from the GTF file):
tophat -p 8 -G Ensembl_genes.gtf --transcriptome-index=GTF_transcriptome_index/known -o SW-1B-1.tophat --no-novel-juncs --library-type=fr-firststrand genome SW-1B-1_AC_QF_FINAL.fastq;
for the rest of the 11 tophat runs using previously built transcriptome index
for k in {2..12}; do
tophat -p 8 --transcriptome-index=GTF_transcriptome_index/known -o SW-1B-$k.tophat --no-novel-juncs --library-type=fr-firststrand genome SW-1B-$k.AC_QF_FINAL.fastq;
STEP 2: cuffdiff
Then I ran cuffdiff in one of two ways. Firstly, specifying three replicate groups (named T,P,N) as such:
cuffdiff -o Diff_OUT -v -b genome.fa -p 8 -L T,P,N -u Ensembl_genes.gtf \
./SW-1B-1.tophat/accepted_hits.bam,./SW-1B-4.tophat/accepted_hits.bam,./SW-1B-7.tophat/accepted_hits.bam,./SW-1B-10.tophat/accepted_hits.bam \
./SW-1B-2.tophat/accepted_hits.bam,./SW-1B-5.tophat/accepted_hits.bam,./SW-1B-8.tophat/accepted_hits.bam,./SW-1B-11.tophat/accepted_hits.bam \
./SW-1B-3.tophat/accepted_hits.bam,./SW-1B-6.tophat/accepted_hits.bam,./SW-1B-9.tophat/accepted_hits.bam,./SW-1B-12.tophat/accepted_hits.bam
Secondly as we wanted to see the FPKMs of each sample separately we ran cuffdiff on each sample separately:
cuffdiff -o Diff_OUT_separate -v -b genome.fa -p 8 -L 1T,2T,3T,4T,1P,2P,3P,4P,1N,2N,3N,4N -u Ensembl_genes.gtf \
./SW-1B-1.tophat/accepted_hits.bam \
./SW-1B-4.tophat/accepted_hits.bam \
./SW-1B-7.tophat/accepted_hits.bam \
./SW-1B-10.tophat/accepted_hits.bam \
./SW-1B-2.tophat/accepted_hits.bam \
./SW-1B-5.tophat/accepted_hits.bam \
./SW-1B-8.tophat/accepted_hits.bam \
./SW-1B-11.tophat/accepted_hits.bam \
./SW-1B-3.tophat/accepted_hits.bam \
./SW-1B-6.tophat/accepted_hits.bam \
./SW-1B-9.tophat/accepted_hits.bam \
./SW-1B-12.tophat/accepted_hits.bam
Step3: confusion!
Now I am really confused: The analysis using replicates as three groups gave us only a handful of significantly expressed genes.
When I look at the FPKMs of these genes in the cuffdiff which ran each sample separately- they don’t add up to what I am seeing when the samplse are grouped as biological replicates.
The gene NCKX1 was supposed to be of average FPKM 2 in T and 7.77 in P; in the cuffdiff run separately this seems to match up OK with the individual FPKMs.
But the gene RGD1359290 was supposed to go from 2.6 in T to 18.8 in P and 17.4 in N; in the individual sampes nothing is even remotely near those values in P, and there is only one high FPKM value in N.
Would love to know what is going on and whether I can trust these data!
Thanks
Noa
STEP 1: tophat
I initially (after quality filtering fastq files) ran tophat with the following command:
For the first run (to make a transcriptome index from the GTF file):
tophat -p 8 -G Ensembl_genes.gtf --transcriptome-index=GTF_transcriptome_index/known -o SW-1B-1.tophat --no-novel-juncs --library-type=fr-firststrand genome SW-1B-1_AC_QF_FINAL.fastq;
for the rest of the 11 tophat runs using previously built transcriptome index
for k in {2..12}; do
tophat -p 8 --transcriptome-index=GTF_transcriptome_index/known -o SW-1B-$k.tophat --no-novel-juncs --library-type=fr-firststrand genome SW-1B-$k.AC_QF_FINAL.fastq;
STEP 2: cuffdiff
Then I ran cuffdiff in one of two ways. Firstly, specifying three replicate groups (named T,P,N) as such:
cuffdiff -o Diff_OUT -v -b genome.fa -p 8 -L T,P,N -u Ensembl_genes.gtf \
./SW-1B-1.tophat/accepted_hits.bam,./SW-1B-4.tophat/accepted_hits.bam,./SW-1B-7.tophat/accepted_hits.bam,./SW-1B-10.tophat/accepted_hits.bam \
./SW-1B-2.tophat/accepted_hits.bam,./SW-1B-5.tophat/accepted_hits.bam,./SW-1B-8.tophat/accepted_hits.bam,./SW-1B-11.tophat/accepted_hits.bam \
./SW-1B-3.tophat/accepted_hits.bam,./SW-1B-6.tophat/accepted_hits.bam,./SW-1B-9.tophat/accepted_hits.bam,./SW-1B-12.tophat/accepted_hits.bam
Secondly as we wanted to see the FPKMs of each sample separately we ran cuffdiff on each sample separately:
cuffdiff -o Diff_OUT_separate -v -b genome.fa -p 8 -L 1T,2T,3T,4T,1P,2P,3P,4P,1N,2N,3N,4N -u Ensembl_genes.gtf \
./SW-1B-1.tophat/accepted_hits.bam \
./SW-1B-4.tophat/accepted_hits.bam \
./SW-1B-7.tophat/accepted_hits.bam \
./SW-1B-10.tophat/accepted_hits.bam \
./SW-1B-2.tophat/accepted_hits.bam \
./SW-1B-5.tophat/accepted_hits.bam \
./SW-1B-8.tophat/accepted_hits.bam \
./SW-1B-11.tophat/accepted_hits.bam \
./SW-1B-3.tophat/accepted_hits.bam \
./SW-1B-6.tophat/accepted_hits.bam \
./SW-1B-9.tophat/accepted_hits.bam \
./SW-1B-12.tophat/accepted_hits.bam
Step3: confusion!
Now I am really confused: The analysis using replicates as three groups gave us only a handful of significantly expressed genes.
When I look at the FPKMs of these genes in the cuffdiff which ran each sample separately- they don’t add up to what I am seeing when the samplse are grouped as biological replicates.
The gene NCKX1 was supposed to be of average FPKM 2 in T and 7.77 in P; in the cuffdiff run separately this seems to match up OK with the individual FPKMs.
But the gene RGD1359290 was supposed to go from 2.6 in T to 18.8 in P and 17.4 in N; in the individual sampes nothing is even remotely near those values in P, and there is only one high FPKM value in N.
Would love to know what is going on and whether I can trust these data!
Thanks
Noa