I have three SOLiD human RNA-Seq libraries which I am analysing using cuffdiff. The libraries have a large number of reads which MAY be pcr duplicates (~90%), but as they are single-end reads I have no way of knowing for sure if they are pcr duplicates or if they are genuine reads which just happen to align to the same location.
I have two sets of BAM files: (1) all uniquely mapped reads, (2) uniquely mapped reads with "duplicates" removed. The library sizes for BAMs(1) are 23, 26 and 49 million, and for BAMs(2) are 2,3 and 4.5 million.
I am comparing the cuffdiff output on these two sets of BAM files. I am using a hg19mRNA gtf file which I have made compatible with cufflinks using this cuffcompare command as recommended in the cufflinks manual:
My cuffdiff command looks like this:
When I look at the RPKM values generated by cuffdiff I see that for the BAM(1) files containing all the reads, only 839 genes get RPKM values in at least 1 of my 3 samples. Cuffdiff on the BAM(2) files with "duplicates" removed generates RPKM values for 19,129 genes. I have positive controls in my samples which do not get RPKM values in the first scenario BAM(1) but do get RPKM values as expected in scenario 2 BAM(2).
Can anyone explain what is going on here?
Thanks
HELEN
I have two sets of BAM files: (1) all uniquely mapped reads, (2) uniquely mapped reads with "duplicates" removed. The library sizes for BAMs(1) are 23, 26 and 49 million, and for BAMs(2) are 2,3 and 4.5 million.
I am comparing the cuffdiff output on these two sets of BAM files. I am using a hg19mRNA gtf file which I have made compatible with cufflinks using this cuffcompare command as recommended in the cufflinks manual:
Code:
cuffcompare -s hg19.fa -CG -r hg19mRNA.gtf hg19mRNA.gtf
Code:
cuffdiff -p 8 -b hg19.fa cuffcmp.combined.gtf file1.bam file2.bam file3.bam
Can anyone explain what is going on here?
Thanks
HELEN
Comment