SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq with FPKM values frymor Bioinformatics 12 04-29-2016 01:29 PM
t-test FPKM values int11ap1 RNA Sequencing 12 07-17-2014 12:57 PM
NOISeq with fpkm values NitaC Bioinformatics 5 07-12-2014 06:11 AM
Cufflinks 0 FPKM values herstein Bioinformatics 2 07-24-2013 11:21 PM
FPKM values are zero budgie lover Bioinformatics 1 09-12-2012 05:54 AM

Reply
 
Thread Tools
Old 10-09-2015, 06:03 AM   #1
priya
Member
 
Location: sweden

Join Date: Apr 2013
Posts: 57
Question Problems with FPKM values

Hi,

I see problems with the fpkm values calculated by cufflinks program. I dont see correlation between counts and fpkms for some genes.

To be more clear I explain my scenario
I have RNA-SEQ data for two conditions (let say condA and condB) with 3 replicates each
for a particular gene : say gene Tagap

Cond A:
Type Rep1 Rep2 Rep3 mean_of_replicates
Count 169 171 124 154
fpkms 4.44 3.8 3.2 3.8

Cond B:
Type Rep1 Rep2 Rep3 mean_of_replicates
Count 17 38 21 25
fpkms 5.0 4.7 4.3 4.6

From above :
CondA (counts) > CondB (counts)
CondA (fpkms) < CondB (fpkms)


I tried to understand the fpkm calculations from cufflinks supplementary methods, but I couldnt completely get their calculations. After googling abit, I thought I can use this simple formulae to do normalization

Nc = (10^9*Counts)/(Transcript_length*Total no. of uniquely mapped reads)
with this formulae, I made use of counts to make normalizations and got below values

Cond A:
Type Rep1 Rep2 Rep3 mean_of_replicates
Nc 1.72 1.88 1.26 1.62

Cond B:
Type Rep1 Rep2 Rep3 mean_of_replicates
Nc 0.18 0.4 0.26 0.28


CondA (Nc) > CondB (Nc)

Nc values correlates with the counts.

Can I use this approach (Nc calculations) to get the normalized values instead of FPKMs calculated from cufflinks?

Why the direction of fold change is completely reversed by cufflinks fpkms?
priya is offline   Reply With Quote
Old 10-09-2015, 06:42 AM   #2
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Which version of Cufflinks are you using?
Also, what is the total number of uniquely mapped reads for each sample?
Have you tried comparing your results with the output of featureCounts and DESeq2?
blancha is offline   Reply With Quote
Old 10-09-2015, 07:29 AM   #3
priya
Member
 
Location: sweden

Join Date: Apr 2013
Posts: 57
Default

Thanks for your reply!
Quote:
Originally Posted by blancha View Post
Which version of Cufflinks are you using?
Also, what is the total number of uniquely mapped reads for each sample?
Have you tried comparing your results with the output of featureCounts and DESeq2?
Which version of Cufflinks are you using?
cufflinks/2.1.1 is used on bam files to calculate the fpkms

Also, what is the total number of uniquely mapped reads for each sample?
This number i got from this command
samtools view -F 4 accepted_hits_sorted_dupRemoved_P1827_101.bam | wc -l
31837064 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
31837064 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Have you tried comparing your results with the output of featureCounts and DESeq2?
No, But the counts I used for comparision is from htSeqCount package
priya is offline   Reply With Quote
Old 10-09-2015, 10:30 AM   #4
yzzhang
Member
 
Location: florida

Join Date: Jan 2013
Posts: 67
Default

So you used different bam file for cufflinks and htseqcount? accepted_hits.bam for cufflinks and accepted_hits_sorted_dupRemoved_P1827_101.bam for htseqcount?
yzzhang is offline   Reply With Quote
Old 10-09-2015, 06:32 PM   #5
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

I would just start by updating Cufflinks.
I have noticed many bugs with older versions of Cufflinks, which often result in incorrect results.
I wonder in fact how much of an impact the Cufflinks bugs have had on genomic research given its very widespread use.
I'd be curious to know if your problem disappears with the latest version of Cufflinks.
blancha is offline   Reply With Quote
Old 10-12-2015, 01:33 AM   #6
priya
Member
 
Location: sweden

Join Date: Apr 2013
Posts: 57
Default

Quote:
Originally Posted by blancha View Post
I would just start by updating Cufflinks.
I have noticed many bugs with older versions of Cufflinks, which often result in incorrect results.
I wonder in fact how much of an impact the Cufflinks bugs have had on genomic research given its very widespread use.
I'd be curious to know if your problem disappears with the latest version of Cufflinks.
Thanks for the suggestion! As the final output tables for htseq and cufflinks are delivered for us by sequencing facility, I am not sure which files they used as input for htseqcount and cufflinks (with duplicates or without), now I am rerunning all the results of Cufflinks and htSeq-Count with duplicated_removed_bam files and update you if I overcome the issue.
priya is offline   Reply With Quote
Old 10-12-2015, 08:29 AM   #7
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Yes, please let me know if the issue is resolved.
I would not remove the duplicates, though.
blancha is offline   Reply With Quote
Old 10-13-2015, 07:32 AM   #8
priya
Member
 
Location: sweden

Join Date: Apr 2013
Posts: 57
Default

Quote:
Originally Posted by blancha View Post
Yes, please let me know if the issue is resolved.
I would not remove the duplicates, though.
Dear blancha,
I rerun htseq-count (0.6.1) and cufflinks(version 2.2.1) on my RNA-Seq samples without removing the duplicates. But I see the same results now, without much changes

First 3 values from Condition A with 3 replicates (red) and last 3 from Condition B with 3 replicates (Green)
Counts
ENSMUSG00000033450 Tagap 169 171 124 17 38 21
FPKMS
ENSMUSG00000033450 Tagap 4.43869 3.79506 3.20408 4.99311 4.66142 4.2739


Counts = Cond A (mean) > Cond B (mean)
FPKMs = Cond A (mean) < Cond B (mean)

Actually this is not for all the genes, for fewer genes (say 15 genes as far I noticed)


Commands used for htseq and cufflinks:

samtools sort $raw_dir/$sam $bam_dir/$nnm'.sorted'

cufflinks --library-type fr-firststrand -p 8 -G $main_dir/data/Mus_musculus.GRCm38.82.gtf -M $main_dir/data/mask_MR.gtf $bam_dir/$nnm'.sorted.bam' -o $cuff_dir

samtools view $bam_dir/$nnm'.sorted.bam' | htseq-count -s reverse - $main_dir/data/Mus_musculus.GRCm38.82.gtf> $ht_dir/$nnm"_count.txt"
priya is offline   Reply With Quote
Old 10-13-2015, 01:06 PM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Have a look at the BAM files in IGV. It's likely that the files with low raw counts have a bunch of multimappers that htseq-count is (correctly, given its purpose) excluding while cufflinks (possibly correctly, given its purpose) is including. You should be able to tell by eye whether cufflinks is being reasonable or not.
dpryan is offline   Reply With Quote
Old 10-13-2015, 01:39 PM   #10
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Yes, @dpryan has a point.

@priya, I just want to point out that Cufflinks doesn't provide you only with the FPKM values, but also with the raw counts.
I initially thought you were giving me the raw counts for Cufflinks, and that there was a discrepancy between the raw counts given by Cufflinks and the FPKM given by Cufflinks.

If there is a difference between the counts for htseq-count and the counts for Cufflinks, this is a different matter altogether, and not necessarily indicative of a bug, since they may count reads differently, as pointed out by @dpryan.

The file "genes.count_tracking" contains the counts for Cuffdiff.

As pointed out by @dpryan, the simplest solution is always to check the reads in the BAM file yourself with IGV, if you are getting different counts with different software programs.

I hope I have not given you any misleading information

Last edited by blancha; 10-13-2015 at 01:58 PM.
blancha is offline   Reply With Quote
Old 10-15-2015, 02:25 AM   #11
priya
Member
 
Location: sweden

Join Date: Apr 2013
Posts: 57
Default

Thanks dpryan and blancha for your suggestions!

@dpryan is correct, I noticed it when I viewed my bam files in IGV, actually its issue with multiple mapped reads for low raw counts.

@blancha: My aim of the analysis is to identify differentially expressed genes between wild-type and knockout lines, after going through this old posts from seqanswers
http://seqanswers.com/forums/showthread.php?t=9129
http://seqanswers.com/forums/showthread.php?t=5793
Two ways - htseqcount -> DE-tests (DESEq/EdgeR/Voom)
Cufflinks -> Cuffdiff
I stick to the first approach and also in my case, problem arises with lower counts and genes with higher counts has no issues (even i compare htseq_counts and fpkms) ; so for downstream analysis I am considering to eliminate low counts with considerable cutoff and perform differential expression analysis.

Another question is ;
If we want to compare htseq_counts between two different genes, we need to normalize by gene-lengths ; so can I do by this formulae: (my beginning question of this post)
for gene i; normalized count (Nc)
Nci = (htseq_count)i * 10^9/ (gene length * Total reads in the sample)

so by this way both gene lengths and sample sizes are corrected (If i am not wrong??)
priya is offline   Reply With Quote
Old 10-15-2015, 04:17 AM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

So that will sort of but not really make things comparable. There are a couple issues:

1) Unless there's only 1 isoform for each of the genes, there's always the question of what the most appropriate gene length is. In an ideal world you'd quantify things at the transcript level and then have an "effective gene length" to use, but that sort of thing takes a bit more work.

2) Typically things like GC content affect what gets sequenced, so you end up needing to correct for that as well.

3) Inevitably there are things not listed here that also need to be accounted for.

The safest method would be to try to do sequencing with some spike-ins that are similar so you can determine how to correct everything, but that'll take a significant time and resource commitment.
dpryan is offline   Reply With Quote
Old 10-15-2015, 05:35 AM   #13
priya
Member
 
Location: sweden

Join Date: Apr 2013
Posts: 57
Default

Good to know about those points! Thanks !
priya is offline   Reply With Quote
Old 10-15-2015, 07:17 AM   #14
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

@priya

The most recent versions of DESeq2 have a function fpkm() which will compute the FPKM values.
You do need to provide the gene lengths yourself.
If you use featureCounts to count the reads, it will compute the gene lengths from the GTF file.
The computations performed by fpkm() are very similar to your formula, but there are some differences explained in the vignette.

The results for the lower count genes are always less reliable, regardless of the program used. The slight differences in the algorithms used to count the reads will also be magnified in lower count genes.
blancha is offline   Reply With Quote
Old 10-15-2015, 07:54 AM   #15
priya
Member
 
Location: sweden

Join Date: Apr 2013
Posts: 57
Default

Quote:
Originally Posted by blancha View Post
@priya

The most recent versions of DESeq2 have a function fpkm() which will compute the FPKM values.
You do need to provide the gene lengths yourself.
If you use featureCounts to count the reads, it will compute the gene lengths from the GTF file.
The computations performed by fpkm() are very similar to your formula, but there are some differences explained in the vignette.

The results for the lower count genes are always less reliable, regardless of the program used. The slight differences in the algorithms used to count the reads will also be magnified in lower count genes.
@blancha Thanks for bringing into my notice about the latest version, its just updated yesterday .. (https://www.bioconductor.org/package...man/DESeq2.pdf)
I see there are lot of new functions added into DESeq2 , I need to explore more of it esp. fpkms() and normalizeGeneLength() .
priya is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:53 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO