SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
cufflinks FPKM >>> Cuffdiff FPKM peromhc Bioinformatics 6 10-17-2012 01:07 PM
Combining FPKM values for a gene john_nl Bioinformatics 5 02-15-2012 11:28 PM
Can I use FPKM to represent gene expression slowsmile Bioinformatics 2 07-01-2011 07:53 AM
multiple FPKM problem for single gene in gene_exp.diff after running cuffdiff ngs RNA Sequencing 4 03-30-2011 01:55 PM
PubMed: Quantification of Gene Transcripts with Deep Sequencing Analysis of Gene Expr Newsbot! Literature Watch 0 01-13-2011 02:00 AM

Reply
 
Thread Tools
Old 05-22-2010, 06:03 PM   #1
PFS
Member
 
Location: USA

Join Date: Mar 2010
Posts: 54
Default How do I get one FPKM value per gene?

I have been running Cufflink on a set of samples. I would like to compare the gene expression across samples. I am using the FPKM values as a measure of the gene abundance, but cuffcompare output provide more than one FPKM value per gene (for those genes that have isoforms). So, how do I go from 2+FPKM values per gene to one single value?

Thanks!
PFS is offline   Reply With Quote
Old 05-22-2010, 06:18 PM   #2
PFS
Member
 
Location: USA

Join Date: Mar 2010
Posts: 54
Default

I should have mentioned in my previous post, that I have tried to compare the FPKMS reported by Cufflinks in the *genes.expr files. I was wondering if cuffcompare is a better way to do that, and if so, how do I summarize the expression per gene rather than per transcript?
Thanks
PFS is offline   Reply With Quote
Old 05-23-2010, 09:51 AM   #3
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 104
Default

You should run cuffdiff and look at the tracking files for genes. They contain the summed FPKM values of transcripts from the same gene.
Thomas Doktor is offline   Reply With Quote
Old 05-24-2010, 11:02 AM   #4
PFS
Member
 
Location: USA

Join Date: Mar 2010
Posts: 54
Default

Thanks! I will do that.

Just out of curiosity, why in the cufflinks output files *_genes.expr (which reports the gene-level coordinates and expression values), sometimes I get more than one row for the same gene? It's like in some cases (noncoding exons??) the FPKM values from the transcripts corresponding to the same gene do not get summed, although the transcripts are assigned to the same gene.

Thanks in advance for your help.
PFS is offline   Reply With Quote
Old 05-25-2010, 05:56 AM   #5
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by PFS View Post
Thanks! I will do that.

Just out of curiosity, why in the cufflinks output files *_genes.expr (which reports the gene-level coordinates and expression values), sometimes I get more than one row for the same gene? It's like in some cases (noncoding exons??) the FPKM values from the transcripts corresponding to the same gene do not get summed, although the transcripts are assigned to the same gene.

Thanks in advance for your help.
This is a known bug in Cufflinks and will be fixed in the next release.
Cole Trapnell is offline   Reply With Quote
Old 08-19-2010, 06:23 AM   #6
Kasimir
Junior Member
 
Location: Berlin

Join Date: Aug 2010
Posts: 1
Default

I have been running Cufffdiff on a set of samples using the newest available release (cuffdiff v0.8.3 (1332); 7/2/2010). The file genes.fpkm_tracking includes in some cases additional FPKM result columns as described by PFS.
I have two questions about it.
Is there a prospective release date for a bug-fixed cuffdiff version?
Does it influence the subsequent differential expression/splicing analysis?
Many thanks in advance,
Kasimir
Kasimir is offline   Reply With Quote
Old 08-20-2010, 02:21 PM   #7
frankyue50
Member
 
Location: CA

Join Date: Nov 2008
Posts: 34
Default

I run into the same problem. I wonder if I could just add the two isoforms values.

Quote:
Originally Posted by PFS View Post
I have been running Cufflink on a set of samples. I would like to compare the gene expression across samples. I am using the FPKM values as a measure of the gene abundance, but cuffcompare output provide more than one FPKM value per gene (for those genes that have isoforms). So, how do I go from 2+FPKM values per gene to one single value?

Thanks!
frankyue50 is offline   Reply With Quote
Old 09-13-2010, 08:19 AM   #8
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 190
Default update

Is this supposed to have been fixed in cufflinks 0.8.3? Doesn't seem fixed to me... I'm still seeing multiple FPKMs a single gene in the _genes.expr files.
mgogol is offline   Reply With Quote
Old 11-04-2010, 03:24 PM   #9
jb2
Member
 
Location: Boston, MA

Join Date: Jun 2010
Posts: 25
Default

I have also been getting some duplicates when examining the genes.expr file. Aligned using tophat to hg19 and used -G option in cufflinks 0.9.2 with ensembl 59 gtf file.

Any ideas?

See some examples here:


Quote:
gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi status

ENSG00000143198 33524 chr1 165600097 165631033 127.41 0 298.498 FAIL
ENSG00000143198 33524 chr1 165614897 165617907 0 0 0 OK
ENSG00000162105 36862 chr11 70313960 70963623 9.58183 0 170.285 FAIL
ENSG00000162105 36862 chr11 70753739 70754197 0 0 0 OK
ENSG00000162105 36862 chr11 70798845 70798972 0 0 0 OK
ENSG00000165899 38298 chr12 80633119 80648905 0 0 0 OK
ENSG00000165899 38299 chr12 80655759 80672003 0 0 0 OK
ENSG00000165899 38300 chr12 80707295 80726842 0 0 0 OK
ENSG00000165899 38301 chr12 80730291 80772870 0 0 0 OK
ENSG00000211890 40491 chr14 106050068 106058270 259.752 227.422 292.082 OK
ENSG00000211890 40491 chr14 106055295 106056387 0 0 0 OK
ENSG00000249751 54186 chr5 138784244 138784863 20.4268 11.3876 29.466 OK
ENSG00000249751 54187 chr5 138837129 138842328 22.1737 12.7559 31.5915 OK
ENSG00000131508 54192 chr5 138906015 139008018 35.9963 23.7319 48.2606 OK
ENSG00000131508 54192 chr5 138945438 138946512 0 0 0 OK
jb2 is offline   Reply With Quote
Old 11-05-2010, 02:14 AM   #10
middlemale
Member
 
Location: Oxford

Join Date: Feb 2010
Posts: 15
Default duplicate errors

jb2, I was facing duplicate errors too. In my case , later I run cufflinks without -G option , then that is fine. you may have a try.
middlemale is offline   Reply With Quote
Old 11-05-2010, 05:18 AM   #11
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 190
Default

I ended up writing a script to sum the FPKMS for a given gene id, which I think is right...

Here's my (unpolished) code (a perl script and a shell script).

This botches the confidence intervals, by the way.

Last edited by mgogol; 11-05-2010 at 05:52 AM.
mgogol is offline   Reply With Quote
Old 11-05-2010, 11:27 AM   #12
jb2
Member
 
Location: Boston, MA

Join Date: Jun 2010
Posts: 25
Default

Quote:
Originally Posted by mgogol View Post
This botches the confidence intervals, by the way.
Yeah, that is what I was worried about, because I was considering taking those into account with my data. I will take a look at your script though since it saves me the time of writing my own.

Hopefully Cole or others can take a look at this and let us know what the problem might be.
jb2 is offline   Reply With Quote
Old 11-10-2010, 12:25 PM   #13
yjlui
Junior Member
 
Location: oxford, uk

Join Date: May 2010
Posts: 5
Default Cufflinks

I was wondering if anyone knows what the status in genes.expr and transcripts.expr (output files of Cufflinks) means? I can't find the meaning in the manual. A possible meaning is "can be one of OK (test successful), NOTEST (not enough alignments for testing), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing", but this is actually the description of "test status" which is a column in the Cuffdiff output files.


What shall I do with genes (or transcripts) whose status is FAIL? Shall I assume that their FPKM is 0 or take the FPKM of these genes regardless of their status?


Cufflinks v0.9.1b was used in my experiments, but the problem of getting multiple FPKM for some genes still exists. Running Cufflinks without a GTF file seems to solve this problem, but then I don't know how to link the FPKM to the corresponding Ensembl ID. If I provide a GTF file when running Cufflinks, I'll get multiple FPKM and FAIL status for some genes.


What shall I do with genes that have multiple FPKM? Shall I add the FPKM together or choose only the FPKM that matches the start and end position of these genes?


Thank you very much for your time.

Last edited by yjlui; 11-11-2010 at 06:53 AM.
yjlui is offline   Reply With Quote
Old 11-10-2010, 02:46 PM   #14
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

Does someone have a small example dataset that I can run this on to find the problem?
adarob is offline   Reply With Quote
Old 11-11-2010, 06:14 AM   #15
yjlui
Junior Member
 
Location: oxford, uk

Join Date: May 2010
Posts: 5
Default

Thanks for the prompt reply, Adam! Just emailed you a small dataset built from my SAM file.
yjlui is offline   Reply With Quote
Old 11-15-2010, 06:25 AM   #16
jiexiong
Junior Member
 
Location: china

Join Date: May 2010
Posts: 9
Default batch ORFs finder for cufflinks assembled transcripts(mrna)

Hi,
I have used the cufflinks assembled the transcripts(mrna) from RNA-SEQ experiment.
my purpose is to check the possible length of the UTRs of each transcripts, and i should firstly find the best ORF for each transcripts, is there any tool for batch find the best ORF?
jiexiong is offline   Reply With Quote
Old 11-17-2010, 09:46 AM   #17
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

I would not draw any conclusions about the FPKM of the FAILED genes.
adarob is offline   Reply With Quote
Old 11-22-2010, 09:38 AM   #18
ngs
Junior Member
 
Location: US

Join Date: Sep 2009
Posts: 2
Default

Quote:
Originally Posted by adarob View Post
The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

I would not draw any conclusions about the FPKM of the FAILED genes.
Hi Adam,
I ran tophat (1.1.0) without a mouse gtf file. Run cufflinks (0.9.1) without a mouse gtf file. Then run cuffcompare with a mouse gtf file and two gtf files generated from cufflinks for my two samples. Finally, I ran cuffdiff with compare.combined.gtf and two accepted_hits.bam files.

However, I checked gene_exp.diff. I found there is still multiple FPKM problem for some genes (see below):

XLOC_000009 Cspp1 chr1:10053629-10189988 q1 q2 OK 44.5012 58.359 0.271096 -2.93789 0.00330457 yes
XLOC_000010 Arfgef1 chr1:10053629-10189988 q1 q2 OK 10.0582 7.68137 -0.269589 4.88261 1.04688e-06 yes
XLOC_000011 Arfgef1 chr1:10053629-10189988 q1 q2 OK 40.66 31.8566 -0.244 17.6406 0 yes
XLOC_000013 Arfgef1 chr1:10053629-10189988 q1 q2 OK 2.7768 40.8059 2.68753 -144.972 0 yes
XLOC_000015 Arfgef1 chr1:10053629-10189988 q1 q2 OK 54.0345 65.0081 0.18489 -12.9339 0 yes
XLOC_000016 Arfgef1 chr1:10053629-10189988 q1 q2 OK 23.4654 43.6672 0.62107 -29.4492 0 yes
XLOC_000031 Tram2 chr1:20986216-20997026 q1 q2 OK 5.8219 2.96147 -0.67594 3.70609 0.000210487 yes
XLOC_000032 Tram2 chr1:20986216-20997026 q1 q2 OK 3.33419 14.9065 1.49757 -29.7646 0 yes
XLOC_000057 Tmem131 chr1:36849038-36996484 q1 q2 OK 37.3723 30.8444 -0.191975 5.03247 4.84195e-07 yes

Did I do something wrong?

I have another question regarding gene_exp.diff file. As you can see, the first gene Cspp1 has the same coordiates (chr1:10053629-10189988) as the second gene Arfgef1. But in my mouse gtf file (from Ensembl), the coordinates for those two genes are:
Cspp1: Chromosome 1: 10,028,299-10,126,849
Arfgef1: Chromosome 1: 10,127,652-10,222,751

Those two genes are not overlapped. Why do they have the same coordinates in gene_exp.diff file?

Thank you very much!
ngs is offline   Reply With Quote
Old 05-08-2011, 09:26 PM   #19
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default

If one has to sum the FPKM for a gene One has to use FPKM gene tracking file or gene expr file of cuffdiff. Mgogol's perl script uses fpkm lo, high and fpkm values which are only in tracking file. Is it ok to sum the fpkm values for a gene?
Thanks
honey is offline   Reply With Quote
Old 05-09-2011, 09:26 PM   #20
ngs_agd
Junior Member
 
Location: India

Join Date: Feb 2011
Posts: 7
Default

Quote:
Originally Posted by adarob View Post
The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

I would not draw any conclusions about the FPKM of the FAILED genes.
Does this mean that I will have to download the cuffcompare file, edit it, upload it on galaxy and then run cuffdiff on this gtf file? Thanks for your help!
ngs_agd is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:07 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.