![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
DESeq with FPKM values | frymor | Bioinformatics | 12 | 04-29-2016 01:29 PM |
t-test FPKM values | int11ap1 | RNA Sequencing | 12 | 07-17-2014 12:57 PM |
NOISeq with fpkm values | NitaC | Bioinformatics | 5 | 07-12-2014 06:11 AM |
Cufflinks 0 FPKM values | herstein | Bioinformatics | 2 | 07-24-2013 11:21 PM |
FPKM values are zero | budgie lover | Bioinformatics | 1 | 09-12-2012 05:54 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: sweden Join Date: Apr 2013
Posts: 57
|
![]()
Hi,
I see problems with the fpkm values calculated by cufflinks program. I dont see correlation between counts and fpkms for some genes. To be more clear I explain my scenario I have RNA-SEQ data for two conditions (let say condA and condB) with 3 replicates each for a particular gene : say gene Tagap Cond A: Type Rep1 Rep2 Rep3 mean_of_replicates Count 169 171 124 154 fpkms 4.44 3.8 3.2 3.8 Cond B: Type Rep1 Rep2 Rep3 mean_of_replicates Count 17 38 21 25 fpkms 5.0 4.7 4.3 4.6 From above : CondA (counts) > CondB (counts) CondA (fpkms) < CondB (fpkms) I tried to understand the fpkm calculations from cufflinks supplementary methods, but I couldnt completely get their calculations. After googling abit, I thought I can use this simple formulae to do normalization Nc = (10^9*Counts)/(Transcript_length*Total no. of uniquely mapped reads) with this formulae, I made use of counts to make normalizations and got below values Cond A: Type Rep1 Rep2 Rep3 mean_of_replicates Nc 1.72 1.88 1.26 1.62 Cond B: Type Rep1 Rep2 Rep3 mean_of_replicates Nc 0.18 0.4 0.26 0.28 CondA (Nc) > CondB (Nc) Nc values correlates with the counts. Can I use this approach (Nc calculations) to get the normalized values instead of FPKMs calculated from cufflinks? Why the direction of fold change is completely reversed by cufflinks fpkms? |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Montreal Join Date: May 2013
Posts: 367
|
![]()
Which version of Cufflinks are you using?
Also, what is the total number of uniquely mapped reads for each sample? Have you tried comparing your results with the output of featureCounts and DESeq2? |
![]() |
![]() |
![]() |
#3 | |
Member
Location: sweden Join Date: Apr 2013
Posts: 57
|
![]()
Thanks for your reply!
Quote:
cufflinks/2.1.1 is used on bam files to calculate the fpkms Also, what is the total number of uniquely mapped reads for each sample? This number i got from this command samtools view -F 4 accepted_hits_sorted_dupRemoved_P1827_101.bam | wc -l 31837064 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 31837064 + 0 mapped (100.00%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Have you tried comparing your results with the output of featureCounts and DESeq2? No, But the counts I used for comparision is from htSeqCount package |
|
![]() |
![]() |
![]() |
#4 |
Member
Location: florida Join Date: Jan 2013
Posts: 67
|
![]()
So you used different bam file for cufflinks and htseqcount? accepted_hits.bam for cufflinks and accepted_hits_sorted_dupRemoved_P1827_101.bam for htseqcount?
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Montreal Join Date: May 2013
Posts: 367
|
![]()
I would just start by updating Cufflinks.
I have noticed many bugs with older versions of Cufflinks, which often result in incorrect results. I wonder in fact how much of an impact the Cufflinks bugs have had on genomic research given its very widespread use. I'd be curious to know if your problem disappears with the latest version of Cufflinks. |
![]() |
![]() |
![]() |
#6 | |
Member
Location: sweden Join Date: Apr 2013
Posts: 57
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Montreal Join Date: May 2013
Posts: 367
|
![]()
Yes, please let me know if the issue is resolved.
I would not remove the duplicates, though. |
![]() |
![]() |
![]() |
#8 | |
Member
Location: sweden Join Date: Apr 2013
Posts: 57
|
![]() Quote:
I rerun htseq-count (0.6.1) and cufflinks(version 2.2.1) on my RNA-Seq samples without removing the duplicates. But I see the same results now, without much changes First 3 values from Condition A with 3 replicates (red) and last 3 from Condition B with 3 replicates (Green) Counts ENSMUSG00000033450 Tagap 169 171 124 17 38 21 FPKMS ENSMUSG00000033450 Tagap 4.43869 3.79506 3.20408 4.99311 4.66142 4.2739 Counts = Cond A (mean) > Cond B (mean) FPKMs = Cond A (mean) < Cond B (mean) Actually this is not for all the genes, for fewer genes (say 15 genes as far I noticed) ![]() Commands used for htseq and cufflinks: samtools sort $raw_dir/$sam $bam_dir/$nnm'.sorted' cufflinks --library-type fr-firststrand -p 8 -G $main_dir/data/Mus_musculus.GRCm38.82.gtf -M $main_dir/data/mask_MR.gtf $bam_dir/$nnm'.sorted.bam' -o $cuff_dir samtools view $bam_dir/$nnm'.sorted.bam' | htseq-count -s reverse - $main_dir/data/Mus_musculus.GRCm38.82.gtf> $ht_dir/$nnm"_count.txt" |
|
![]() |
![]() |
![]() |
#9 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Have a look at the BAM files in IGV. It's likely that the files with low raw counts have a bunch of multimappers that htseq-count is (correctly, given its purpose) excluding while cufflinks (possibly correctly, given its purpose) is including. You should be able to tell by eye whether cufflinks is being reasonable or not.
|
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Montreal Join Date: May 2013
Posts: 367
|
![]()
Yes, @dpryan has a point.
@priya, I just want to point out that Cufflinks doesn't provide you only with the FPKM values, but also with the raw counts. I initially thought you were giving me the raw counts for Cufflinks, and that there was a discrepancy between the raw counts given by Cufflinks and the FPKM given by Cufflinks. If there is a difference between the counts for htseq-count and the counts for Cufflinks, this is a different matter altogether, and not necessarily indicative of a bug, since they may count reads differently, as pointed out by @dpryan. The file "genes.count_tracking" contains the counts for Cuffdiff. As pointed out by @dpryan, the simplest solution is always to check the reads in the BAM file yourself with IGV, if you are getting different counts with different software programs. I hope I have not given you any misleading information Last edited by blancha; 10-13-2015 at 01:58 PM. |
![]() |
![]() |
![]() |
#11 |
Member
Location: sweden Join Date: Apr 2013
Posts: 57
|
![]()
Thanks dpryan and blancha for your suggestions!
@dpryan is correct, I noticed it when I viewed my bam files in IGV, actually its issue with multiple mapped reads for low raw counts. @blancha: My aim of the analysis is to identify differentially expressed genes between wild-type and knockout lines, after going through this old posts from seqanswers http://seqanswers.com/forums/showthread.php?t=9129 http://seqanswers.com/forums/showthread.php?t=5793 Two ways - htseqcount -> DE-tests (DESEq/EdgeR/Voom) Cufflinks -> Cuffdiff I stick to the first approach and also in my case, problem arises with lower counts and genes with higher counts has no issues (even i compare htseq_counts and fpkms) ; so for downstream analysis I am considering to eliminate low counts with considerable cutoff and perform differential expression analysis. Another question is ; If we want to compare htseq_counts between two different genes, we need to normalize by gene-lengths ; so can I do by this formulae: (my beginning question of this post) for gene i; normalized count (Nc) Nci = (htseq_count)i * 10^9/ (gene length * Total reads in the sample) so by this way both gene lengths and sample sizes are corrected (If i am not wrong??) |
![]() |
![]() |
![]() |
#12 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
So that will sort of but not really make things comparable. There are a couple issues:
1) Unless there's only 1 isoform for each of the genes, there's always the question of what the most appropriate gene length is. In an ideal world you'd quantify things at the transcript level and then have an "effective gene length" to use, but that sort of thing takes a bit more work. 2) Typically things like GC content affect what gets sequenced, so you end up needing to correct for that as well. 3) Inevitably there are things not listed here that also need to be accounted for. The safest method would be to try to do sequencing with some spike-ins that are similar so you can determine how to correct everything, but that'll take a significant time and resource commitment. |
![]() |
![]() |
![]() |
#13 |
Member
Location: sweden Join Date: Apr 2013
Posts: 57
|
![]()
Good to know about those points! Thanks !
|
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: Montreal Join Date: May 2013
Posts: 367
|
![]()
@priya
The most recent versions of DESeq2 have a function fpkm() which will compute the FPKM values. You do need to provide the gene lengths yourself. If you use featureCounts to count the reads, it will compute the gene lengths from the GTF file. The computations performed by fpkm() are very similar to your formula, but there are some differences explained in the vignette. The results for the lower count genes are always less reliable, regardless of the program used. The slight differences in the algorithms used to count the reads will also be magnified in lower count genes. |
![]() |
![]() |
![]() |
#15 | |
Member
Location: sweden Join Date: Apr 2013
Posts: 57
|
![]() Quote:
I see there are lot of new functions added into DESeq2 , I need to explore more of it esp. fpkms() and normalizeGeneLength() . ![]() ![]() |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|