SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   How do I get one FPKM value per gene? (http://seqanswers.com/forums/showthread.php?t=5224)

PFS 05-22-2010 06:03 PM

How do I get one FPKM value per gene?
 
I have been running Cufflink on a set of samples. I would like to compare the gene expression across samples. I am using the FPKM values as a measure of the gene abundance, but cuffcompare output provide more than one FPKM value per gene (for those genes that have isoforms). So, how do I go from 2+FPKM values per gene to one single value?

Thanks!

PFS 05-22-2010 06:18 PM

I should have mentioned in my previous post, that I have tried to compare the FPKMS reported by Cufflinks in the *genes.expr files. I was wondering if cuffcompare is a better way to do that, and if so, how do I summarize the expression per gene rather than per transcript?
Thanks

Thomas Doktor 05-23-2010 09:51 AM

You should run cuffdiff and look at the tracking files for genes. They contain the summed FPKM values of transcripts from the same gene.

PFS 05-24-2010 11:02 AM

Thanks! I will do that.

Just out of curiosity, why in the cufflinks output files *_genes.expr (which reports the gene-level coordinates and expression values), sometimes I get more than one row for the same gene? It's like in some cases (noncoding exons??) the FPKM values from the transcripts corresponding to the same gene do not get summed, although the transcripts are assigned to the same gene.

Thanks in advance for your help.

Cole Trapnell 05-25-2010 05:56 AM

Quote:

Originally Posted by PFS (Post 19026)
Thanks! I will do that.

Just out of curiosity, why in the cufflinks output files *_genes.expr (which reports the gene-level coordinates and expression values), sometimes I get more than one row for the same gene? It's like in some cases (noncoding exons??) the FPKM values from the transcripts corresponding to the same gene do not get summed, although the transcripts are assigned to the same gene.

Thanks in advance for your help.

This is a known bug in Cufflinks and will be fixed in the next release.

Kasimir 08-19-2010 06:23 AM

I have been running Cufffdiff on a set of samples using the newest available release (cuffdiff v0.8.3 (1332); 7/2/2010). The file genes.fpkm_tracking includes in some cases additional FPKM result columns as described by PFS.
I have two questions about it.
Is there a prospective release date for a bug-fixed cuffdiff version?
Does it influence the subsequent differential expression/splicing analysis?
Many thanks in advance,
Kasimir

frankyue50 08-20-2010 02:21 PM

I run into the same problem. I wonder if I could just add the two isoforms values.

Quote:

Originally Posted by PFS (Post 18979)
I have been running Cufflink on a set of samples. I would like to compare the gene expression across samples. I am using the FPKM values as a measure of the gene abundance, but cuffcompare output provide more than one FPKM value per gene (for those genes that have isoforms). So, how do I go from 2+FPKM values per gene to one single value?

Thanks!


mgogol 09-13-2010 08:19 AM

update
 
Is this supposed to have been fixed in cufflinks 0.8.3? Doesn't seem fixed to me... I'm still seeing multiple FPKMs a single gene in the _genes.expr files.

jb2 11-04-2010 03:24 PM

I have also been getting some duplicates when examining the genes.expr file. Aligned using tophat to hg19 and used -G option in cufflinks 0.9.2 with ensembl 59 gtf file.

Any ideas?

See some examples here:


Quote:

gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi status

ENSG00000143198 33524 chr1 165600097 165631033 127.41 0 298.498 FAIL
ENSG00000143198 33524 chr1 165614897 165617907 0 0 0 OK
ENSG00000162105 36862 chr11 70313960 70963623 9.58183 0 170.285 FAIL
ENSG00000162105 36862 chr11 70753739 70754197 0 0 0 OK
ENSG00000162105 36862 chr11 70798845 70798972 0 0 0 OK
ENSG00000165899 38298 chr12 80633119 80648905 0 0 0 OK
ENSG00000165899 38299 chr12 80655759 80672003 0 0 0 OK
ENSG00000165899 38300 chr12 80707295 80726842 0 0 0 OK
ENSG00000165899 38301 chr12 80730291 80772870 0 0 0 OK
ENSG00000211890 40491 chr14 106050068 106058270 259.752 227.422 292.082 OK
ENSG00000211890 40491 chr14 106055295 106056387 0 0 0 OK
ENSG00000249751 54186 chr5 138784244 138784863 20.4268 11.3876 29.466 OK
ENSG00000249751 54187 chr5 138837129 138842328 22.1737 12.7559 31.5915 OK
ENSG00000131508 54192 chr5 138906015 139008018 35.9963 23.7319 48.2606 OK
ENSG00000131508 54192 chr5 138945438 138946512 0 0 0 OK

middlemale 11-05-2010 02:14 AM

duplicate errors
 
jb2, I was facing duplicate errors too. In my case , later I run cufflinks without -G option , then that is fine. you may have a try.

mgogol 11-05-2010 05:18 AM

I ended up writing a script to sum the FPKMS for a given gene id, which I think is right...

Here's my (unpolished) code (a perl script and a shell script).

This botches the confidence intervals, by the way.

jb2 11-05-2010 11:27 AM

Quote:

Originally Posted by mgogol (Post 28651)
This botches the confidence intervals, by the way.

Yeah, that is what I was worried about, because I was considering taking those into account with my data. I will take a look at your script though since it saves me the time of writing my own.

Hopefully Cole or others can take a look at this and let us know what the problem might be.

yjlui 11-10-2010 12:25 PM

Cufflinks
 
I was wondering if anyone knows what the status in genes.expr and transcripts.expr (output files of Cufflinks) means? I can't find the meaning in the manual. A possible meaning is "can be one of OK (test successful), NOTEST (not enough alignments for testing), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing", but this is actually the description of "test status" which is a column in the Cuffdiff output files.


What shall I do with genes (or transcripts) whose status is FAIL? Shall I assume that their FPKM is 0 or take the FPKM of these genes regardless of their status?


Cufflinks v0.9.1b was used in my experiments, but the problem of getting multiple FPKM for some genes still exists. Running Cufflinks without a GTF file seems to solve this problem, but then I don't know how to link the FPKM to the corresponding Ensembl ID. If I provide a GTF file when running Cufflinks, I'll get multiple FPKM and FAIL status for some genes.


What shall I do with genes that have multiple FPKM? Shall I add the FPKM together or choose only the FPKM that matches the start and end position of these genes?


Thank you very much for your time.

adarob 11-10-2010 02:46 PM

Does someone have a small example dataset that I can run this on to find the problem?

yjlui 11-11-2010 06:14 AM

Thanks for the prompt reply, Adam! Just emailed you a small dataset built from my SAM file.

jiexiong 11-15-2010 06:25 AM

batch ORFs finder for cufflinks assembled transcripts(mrna)
 
Hi,
I have used the cufflinks assembled the transcripts(mrna) from RNA-SEQ experiment.
my purpose is to check the possible length of the UTRs of each transcripts, and i should firstly find the best ORF for each transcripts, is there any tool for batch find the best ORF?

adarob 11-17-2010 09:46 AM

The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

I would not draw any conclusions about the FPKM of the FAILED genes.

ngs 11-22-2010 09:38 AM

Quote:

Originally Posted by adarob (Post 29494)
The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

I would not draw any conclusions about the FPKM of the FAILED genes.

Hi Adam,
I ran tophat (1.1.0) without a mouse gtf file. Run cufflinks (0.9.1) without a mouse gtf file. Then run cuffcompare with a mouse gtf file and two gtf files generated from cufflinks for my two samples. Finally, I ran cuffdiff with compare.combined.gtf and two accepted_hits.bam files.

However, I checked gene_exp.diff. I found there is still multiple FPKM problem for some genes (see below):

XLOC_000009 Cspp1 chr1:10053629-10189988 q1 q2 OK 44.5012 58.359 0.271096 -2.93789 0.00330457 yes
XLOC_000010 Arfgef1 chr1:10053629-10189988 q1 q2 OK 10.0582 7.68137 -0.269589 4.88261 1.04688e-06 yes
XLOC_000011 Arfgef1 chr1:10053629-10189988 q1 q2 OK 40.66 31.8566 -0.244 17.6406 0 yes
XLOC_000013 Arfgef1 chr1:10053629-10189988 q1 q2 OK 2.7768 40.8059 2.68753 -144.972 0 yes
XLOC_000015 Arfgef1 chr1:10053629-10189988 q1 q2 OK 54.0345 65.0081 0.18489 -12.9339 0 yes
XLOC_000016 Arfgef1 chr1:10053629-10189988 q1 q2 OK 23.4654 43.6672 0.62107 -29.4492 0 yes
XLOC_000031 Tram2 chr1:20986216-20997026 q1 q2 OK 5.8219 2.96147 -0.67594 3.70609 0.000210487 yes
XLOC_000032 Tram2 chr1:20986216-20997026 q1 q2 OK 3.33419 14.9065 1.49757 -29.7646 0 yes
XLOC_000057 Tmem131 chr1:36849038-36996484 q1 q2 OK 37.3723 30.8444 -0.191975 5.03247 4.84195e-07 yes

Did I do something wrong?

I have another question regarding gene_exp.diff file. As you can see, the first gene Cspp1 has the same coordiates (chr1:10053629-10189988) as the second gene Arfgef1. But in my mouse gtf file (from Ensembl), the coordinates for those two genes are:
Cspp1: Chromosome 1: 10,028,299-10,126,849
Arfgef1: Chromosome 1: 10,127,652-10,222,751

Those two genes are not overlapped. Why do they have the same coordinates in gene_exp.diff file?

Thank you very much!

honey 05-08-2011 09:26 PM

If one has to sum the FPKM for a gene One has to use FPKM gene tracking file or gene expr file of cuffdiff. Mgogol's perl script uses fpkm lo, high and fpkm values which are only in tracking file. Is it ok to sum the fpkm values for a gene?
Thanks

ngs_agd 05-09-2011 09:26 PM

Quote:

Originally Posted by adarob (Post 29494)
The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

I would not draw any conclusions about the FPKM of the FAILED genes.

Does this mean that I will have to download the cuffcompare file, edit it, upload it on galaxy and then run cuffdiff on this gtf file? Thanks for your help!


All times are GMT -8. The time now is 06:52 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.