SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cuffdiff output - how reliable? whoopster101 RNA Sequencing 12 02-23-2014 08:15 AM
Cuffdiff output sheenams RNA Sequencing 0 11-27-2011 03:41 PM
Cuffdiff Output Explanation fongchun Bioinformatics 4 11-08-2011 10:26 AM
my understanding for cuffdiff output Huijuan Bioinformatics 1 05-01-2011 04:42 AM
cuffdiff output dnusol Bioinformatics 2 02-08-2011 10:31 PM

Reply
 
Thread Tools
Old 01-12-2011, 03:35 AM   #1
Rachelly
Member
 
Location: Israel

Join Date: Oct 2010
Posts: 37
Default CuffDiff output

Hi all,
I used Cufflinks in the following work-flow:
CuffLinks -> CuffCompare -> CuffDiff

The output file genes.fpkm_tracking didn't include reference genes at all:

Code:
tracking_id     class_code      nearest_ref_id  gene_short_name tss_id  locus   MM_FPKM MM_conf_lo      MM_conf_hi      LOG_FPKM        LOG_conf_lo     LOG_conf_hi     SFT_FPKM        SFT_conf_lo     SFT_conf_hi     NY_FPKMNY_conf_lo       NY_conf_hi
XLOC_000001     -       -       -       -       SL2.30ch00:551338-551631        1.66555 0       4.24667 0.446456        0       1.7828  0.841447        0       2.67606 0       0       0
XLOC_000002     -       -       -       -       SL2.30ch00:4196781-4198207      122.746 100.586 144.907 185.302 158.075 212.529 121.462 99.1515 143.773 1616.46 1469.49 1763.43
Even though the combined.gtf that was created in CuffCompare did contain a lot overlaps with the known genes. Also the isoforms.fpkm_tracking output file DID contain reference annotations, but in the level of exons:
Code:
tracking_id     class_code      nearest_ref_id  gene_short_name tss_id  locus   MM_FPKM MM_conf_lo      MM_conf_hi      LOG_FPKM        LOG_conf_lo     LOG_conf_hi     SFT_FPKM        SFT_conf_lo     SFT_conf_hi     NY_FPKMNY_conf_lo       NY_conf_hi
TCONS_00000001  =       exon:Solyc00g005040.1.1.3       -       -       SL2.30ch00:551338-551631        1.66555 0       4.24667 0.446456        0       1.7828  0.841447        0       2.67606 0       0       0
TCONS_00000002  o       exon:Solyc00g006470.1.1.4       -       -       SL2.30ch00:4196781-4198207      62.9947 47.1187 78.8707 95.0768 75.573  114.581 52.5381 37.9501 67.126  972.538 908.856 1036.22
* Of course, when I only ran CuffDiff with the reference GTF - I got gene expression levels with the known genes.

My questions is:
Is there a way to get gene (and not exon) expression levels AND novel transcripts using Cufflinks?
And why in the genes.fpkm_tracking file I don't get the closest reference annotation to that gene?

Thanks!
Rachelly.
Rachelly is offline   Reply With Quote
Old 01-24-2011, 03:42 PM   #2
honey
Senior Member
 
Location: Pittsburgh

Join Date: Feb 2010
Posts: 151
Default gene level

For gene level run TopHat with Ensembl/ refflat GTF file
honey is offline   Reply With Quote
Old 02-22-2011, 11:59 PM   #3
Rachelly
Member
 
Location: Israel

Join Date: Oct 2010
Posts: 37
Default Cole's answer

I consulted Cole on this matter and this was his reply:

Quote:
Actually, you won't see those id's in the genes.fpkm_tracking (or, IIRC, the tss_group.fpkm_tracking) files, because as far as Cufflinks is concerned, genes and tss groups are *sets* of transcripts. Each transcript in a gene could have a different nearest reference transcript, so we don't put anything in that field.
However, the way we recommend doing what (I think) you want here is to use the gene_name attribute. If you compare to a reference file that has gene_name attributes, they will get propogated to the stdout.combined.gtf file from cuffcompare. Ensembl has the gene_name attributes already built in (and the values are typically the HUGO names in the case of human), but you could add them to your reference if they're not there already.
Rachelly is offline   Reply With Quote
Old 03-08-2011, 03:49 PM   #4
greener
Member
 
Location: Seattle, WA

Join Date: Sep 2010
Posts: 17
Default

Quote:
Originally Posted by Rachelly View Post
I consulted Cole on this matter and this was his reply:
Hi Rachelly, I seem to having the same problem. My Cuffdiff output does not contain gene names. Could you post an example of a reference file that worked and the commands you ran that worked? I tried rerunning cuffcompare with ensembl which contained gene_name attributes but that did not seem to work. The output of my ensembl annotation file:

11 pseudogene exon 86649 87586 . - . gene_id "ENSG00000224777"; transcript_id "ENST00000424047"; exon_number "1"; gene_name "OR4F2P"; transcript_name "OR4F2P-001";
11 protein_coding exon 129060 129388 . - . gene_id "ENSG00000230724"; transcript_id "ENST00000382784"; exon_number "1"; gene_name "AC069287.3"; transcript_name "AC069287.3-201";
greener is offline   Reply With Quote
Old 03-09-2011, 04:50 AM   #5
severin
Genome Informatics Facility
 
Location: Iowa @isugif

Join Date: Sep 2009
Posts: 105
Default Cuffcompare

If you ran Cuffcompare with a reference file you can extract the significant Cuffdiff transcript piles and grep out those lines in your combined gtf file which should contain your gene ids. This will tell you which genes are significant.

Requires unix commands cut, awk, grep, | (pipe) and xargs -I
severin is offline   Reply With Quote
Old 03-14-2011, 10:12 PM   #6
jasonwood
Member
 
Location: RI

Join Date: May 2010
Posts: 10
Default

I found that I had to use the -s switch in cuffcompare in order for it to propagate my gene names (with gene_name attribute in last column of GTF) all the way through to the final cuffdiff files.
jasonwood is offline   Reply With Quote
Old 03-20-2012, 11:02 PM   #7
kareldegendt
Junior Member
 
Location: San Diego

Join Date: Feb 2012
Posts: 9
Default is genes.gtf the correct annotation file?

Hi all,
I had the same problem, but figured that I had to run tophat with the Ensmble "genes.gtf" file, which is what I did.
All works fine, untill I want to run Cuffmerge:
There I'm getting the following error:

Error: duplicate GFF ID 'ENSMUST00000098282' encountered!
[FAILED]

In another set I was running, I get the same error with a different ENSMUST number.
Any clue on what's wrong here? Obviously there's multiple lies with that ID, but why did it go allright with Tophat then????

Thanks!
K.
kareldegendt is offline   Reply With Quote
Old 03-31-2012, 10:44 PM   #8
kareldegendt
Junior Member
 
Location: San Diego

Join Date: Feb 2012
Posts: 9
Default

Ok, I found the issue. Turns out I was being too "efficient"

I am comparing 2 times 2 datasets, and I was already running the cuffmerge on the second set while the run on the first dataset was still ongoing (wanted to be fast...).
However, I forgot to change the directory name, so both runs saved to the same dir... and ran into problems.
It was all solved when I assigned them different directories...

Karel
kareldegendt is offline   Reply With Quote
Old 04-12-2012, 12:18 PM   #9
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

Sorry, I know this is a basic question comparatively, but can someone give me a quick take on the gene ID's. I ran cuffdiff to get the significantly differentially expressed genes. I want to view them in DAVID or Ensembl to check out the actual pathways. I saved all of my 300 or so genes in a txt file with many genes having more than 1 unique ID (e.g. B1AKN3,NP_001036147,Q9P2R6,uc001aph.1) and uploaded to DAVID. However, it could only "ambiguously" match 25 of these genes. What kind of gene IDs are these? There are appear to be more than one kind. How do you view your pathways???
billstevens is offline   Reply With Quote
Old 04-13-2012, 11:54 AM   #10
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

bump

Sorry, I'm just having trouble working with these gene names. Some are UniProt, some are RefSeq, some are UCSC. How do you guys do it? DAVID has no idea what I'm uploading? What do you guys use? And does it recognize all the gene names?
billstevens is offline   Reply With Quote
Old 04-15-2012, 11:15 AM   #11
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

Please help...

I'm sorry, I'm just so confused on this. Why are there more than one genes listed for promoters.diff, or tss_group.diff, or even gene_exp.diff??? I just don't get it. It says right there in the Cufflinks manual, and I'm quoting:

"Transcripts with the same gene_id are part of the same gene group, and similarly, those with the same tss_id and p_id are part of the same primary transcript group and CDS group. "

How can one transcription start site be associated with more than one gene?? Likewise with promoters and CDS?

Sincere thanks to anyone that can help me with this!

Last edited by billstevens; 04-15-2012 at 01:12 PM.
billstevens is offline   Reply With Quote
Old 04-17-2012, 08:04 PM   #12
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

Hey guys,

So I have this plan for analyzing my data using DAVID, and I was hoping maybe someone might say how they do their differential expression gene analysis. From the output of gene_expression.diff file, I take the significant genes and then I remove all of the subsets of genes (e.g. if uc0012w.1, i make it uc0012w) and then I load this into DAVID. I got rid of the subsets because oftentimes DAVID couldn't find the subset, but DAVID did recognize it without the subset, and I imagine they would both have the same gene. I found that DAVID recognizes all genes that have been reviewed. This seems like a nice and straightforward method for obtaining my network.

Am I totally off-base? Anyone?
billstevens is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:49 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO