SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
What is the meaning density in CummeRbund plots? Carlos Borroto Bioinformatics 16 12-29-2016 09:05 PM
cummeRbund get isoforms paolo.kunder Bioinformatics 6 04-23-2014 07:59 AM
Biological replicates with cuffdiff, cummeRbund turnersd Bioinformatics 15 11-19-2012 05:59 AM
CummeRbund error without CDS dvanic Bioinformatics 5 03-12-2012 09:24 AM
gene name in cufdiif output? semna Bioinformatics 3 12-07-2010 03:41 AM

Reply
 
Thread Tools
Old 03-12-2012, 12:50 PM   #1
turnersd
Senior Member
 
Location: Charlottesville, VA

Join Date: May 2011
Posts: 112
Default cummeRbund - how to get gene name in diffData output

I'm going through the tophat/cufflinks/cuffmerge/cuffdiff/cummeRbund protocol just published at Nature Protocols. I'm wondering how to get a different identifier (e.g. the gene symbol instead of XLOC) in the diffData(genes(cuff)) output for printing tables, annotating plots, etc.

Code:
> cuff <- readCufflinks()
> gene_diff_data <- diffData(genes(cuff))
> sig_gene_data <- subset(gene_diff_data, significant=="yes")
> head(sig_gene_data)
        gene_id sample_1 sample_2 status   value_1   value_2 ln_fold_change test_stat     p_value     q_value significant
3   XLOC_000003       C1       C2     OK   48.7791   77.9603       0.676476 -11.81010 0.00000e+00 0.00000e+00         yes
60  XLOC_000060       C1       C2     OK   65.0337  107.4470       0.724360 -39.42460 0.00000e+00 0.00000e+00         yes
81  XLOC_000081       C1       C2     OK   26.8576   23.4374      -0.196521   4.68763 2.76389e-06 1.26818e-04         yes
175 XLOC_000175       C1       C2     OK 2724.4700 2514.0400      -0.115971   7.49516 6.61693e-14 4.70596e-12         yes
237 XLOC_000237       C1       C2     OK   23.3386   19.1596      -0.284646   3.73237 1.89689e-04 6.18840e-03         yes
250 XLOC_000250       C1       C2     OK   24.0661   42.1466       0.808413  -9.23741 0.00000e+00 0.00000e+00         yes
>
turnersd is offline   Reply With Quote
Old 03-13-2012, 10:28 AM   #2
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default

Hi turnersd,
The workflow for cummeRbund has been simplified a bit since the paper was submitted. The recommended approach to this (for cummeRbund 1.1.3 or greater) is as follows

Code:
> cuff <- readCufflinks()

#Retrive significant gene IDs (XLOC) with a pre-specified alpha
> diffGeneIDs <- getSig(cuff,level="genes",alpha=0.05)

#Use returned identifiers to create a CuffGeneSet object with all relevant info for given genes
> diffGenes<-getGenes(cuff,diffGeneIDs)

#gene_short_name values (and corresponding XLOC_* values) can be retrieved from the CuffGeneSet by using:
> featureNames(diffGenes)
fpkm(), fpkmMatrix(), features(), and diffData() are all available methods for the CuffGeneSet object as well.

Cheers,
Loyal
lgoff is offline   Reply With Quote
Old 03-25-2012, 10:35 PM   #3
kareldegendt
Junior Member
 
Location: San Diego

Join Date: Feb 2012
Posts: 9
Default weird...

cummeRbund tells me this:

> diffGeneIDs <-getSig(cuff_data_Input,level="genes",alpha=0.05)
Error: could not find function "getSig"

Did I do something wrong?
kareldegendt is offline   Reply With Quote
Old 03-25-2012, 10:38 PM   #4
kareldegendt
Junior Member
 
Location: San Diego

Join Date: Feb 2012
Posts: 9
Default

OK, got it, I got an older version (0.1.3)....
But where can I find the latest version then?

K.

Last edited by kareldegendt; 03-25-2012 at 10:45 PM. Reason: new problem
kareldegendt is offline   Reply With Quote
Old 03-26-2012, 05:03 PM   #5
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default

Hi,
you can find the freshest cummeRbund at:

http://compbio.mit.edu/cummeRbund/

Also, be sure to sign up for the bowtie-bio-announce mailing list if you would like to be updated to new releases/features.

Cheers,
Loyal
lgoff is offline   Reply With Quote
Old 03-26-2012, 09:40 PM   #6
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default

Hi Loyal,

I'm facing the same situation as kareldegendt.

Can you provide a set of instructions for upgrading to a newer version when cummerRbund is already on a system (64-bit Linux)? This would be very helpful for R novices like myself. I've tried downloading and unzipping the tarball and using make but that doesn't seem to work.

Thanks,

Shurjo
shurjo is offline   Reply With Quote
Old 03-27-2012, 11:22 AM   #7
kareldegendt
Junior Member
 
Location: San Diego

Join Date: Feb 2012
Posts: 9
Default

OK, here's what I did:

I first downloaded the cummeRbund Mac OS X binary (for version 1.1.5)

Then I did this in R:
> install.packages('/Users/kareldegendt/Downloads/cummeRbund_1.1.5.tgz',repos = NULL)

then I added cummeRbund to the current session:

>library(cummeRbund)

It loaded abunch of things and told me that the package cummeRbund was built under R version 2.15.00
No clue if that's gonna hurt anything. I'll test and if so, I'll probably upgrade R...

best,
Karel
kareldegendt is offline   Reply With Quote
Old 03-27-2012, 11:23 AM   #8
kareldegendt
Junior Member
 
Location: San Diego

Join Date: Feb 2012
Posts: 9
Default

OK, this works BUT:

I still did not get gene symbols (like f.e. Akt or Bact) but the XLOC_000.... and uc00.... names...
Not really a solution :-/

K.

Last edited by kareldegendt; 03-27-2012 at 01:04 PM. Reason: typo
kareldegendt is offline   Reply With Quote
Old 03-27-2012, 03:39 PM   #9
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default

Hi Karel,

Thanks for the tips. They worked for me as well.

Regards,

Shurjo
shurjo is offline   Reply With Quote
Old 03-28-2012, 01:50 AM   #10
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

You could do it with merge:
Quote:
cuff <- readCufflinks()

#Retrive significant gene IDs (XLOC) with a pre-specified alpha
diffGeneIDs <- getSig(cuff,level="genes",alpha=0.05)

#Use returned identifiers to create a CuffGeneSet object with all relevant info for given genes
diffGenes<-getGenes(cuff,diffGeneIDs)

#gene_short_name values (and corresponding XLOC_* values) can be retrieved from the CuffGeneSet by using:
names<-featureNames(diffGenes)
row.names(names)=names$tracking_id
diffGenesNames<-as.matrix(names)
diffGenesNames<-diffGenesNames[,-1]

# get the data for the significant genes
diffGenesData<-diffData(diffGenes)
row.names(diffGenesData)=diffGenesData$gene_id
diffGenesData<-diffGenesData[,-1]

# merge the two matrices by row names
diffGenesOutput<-merge(diffGenesNames,diffGenesData,by="row.names")
Thomas Doktor is offline   Reply With Quote
Old 03-28-2012, 06:34 AM   #11
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default

Hi All,
Sorry I missed the earlier posts in this thread. Sorry for the troubles in getting updates to cummeRbund installed, but it appears that you were both successful.

The reason for this setup is that a 'stable' version of cummeRbund (v1.0.0) has to be maintained with the current 'release' version of Bioconductor. Active development of new features (including getSig, etc) is done on the 'development' version of Bioconductor which is attached to the 'development' release of R (currently v2.15). When the new version of Bioconductor is released (in the next few weeks), all of the development features of cummeRbund will be available using the standard BioC install methods. The benefit of using this is obviously earlier access to newer features but the drawback is of course a moderate amount of instability and growing pains. You can also install the development version of R and the most recent version of cummeRbund will be installed by BiocLite() by default.

The way I am trying to write cummeRbund, the 'development' versions should also be compatible with earlier versions of R (at least 2.13 or greater).

To answer the question of gene names directly, they can always be accessed as part of the 'features' data.frame returned from a call to features() on a CuffData or CuffGeneSet object. The 'featureNames()' function is just a shorthand that in most cases just returns the gene_id and gene_short_names (when present) only.

Another common way to represent the FPKM data is in a 'matrix' format of featuresXconditions. You can use fpkmMatrix(myGeneSet,fullnames=T) to generate this matrix.

The general problem with using gene names is that they are inherently non-unique (despite efforts to enforce this). This causes significant problems for a lot of the behind-the-scenes data wrangling in both cufflinks/cuffdiff and cummeRbund. This is why the XLOC_* and TCONS_* ids are essential to track individual features. Our suggestion, as mentioned above, is to use the 'features()' method to get all annotation associated with features in a CuffData, CuffGeneSet, or CuffGene object. The output of this method should be a standard R data.frame on which you can do any manipulations/merges that you would like. Please let me know if you have specific workflows in which you are having difficulty mapping these ids to gene names and I can help with the syntax.

Cheers!

Loyal
lgoff is offline   Reply With Quote
Old 03-28-2012, 06:35 AM   #12
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default

Quote:
Originally Posted by Thomas Doktor View Post
You could do it with merge:
Thanks Thomas for posting this solution...

Cheers,
Loyal
lgoff is offline   Reply With Quote
Old 03-30-2012, 06:12 PM   #13
Kittykat22
Junior Member
 
Location: sydney

Join Date: Mar 2012
Posts: 4
Default

Hi everyone,
very helpful responses so far! Retrieving the gene names works great for me, but unfortunately just for differentially expressed genes. If I want to look at splicing or isoforms I cannot get it to work. I just get a list of identifiers or, if I try with the above path and use "Isoforms" or "Splicing" instead of "Genes" the list is empty.
I am very new to all these things so not sure what I am doing wrong or what I have to do to determine the significantly differentially expressed isoforms, splicing, promoters,...and get a list including the gene names.
Will be very greatful for any help!
Cheers,
K
Kittykat22 is offline   Reply With Quote
Old 03-31-2012, 10:35 PM   #14
kareldegendt
Junior Member
 
Location: San Diego

Join Date: Feb 2012
Posts: 9
Default

Hi all,
It finally worked for me too. I re-ran tophat and cufflinks with the refFlat file as an annotation file and that did the trick :-)

Thanks for all your help!!

Karel
kareldegendt is offline   Reply With Quote
Old 04-01-2012, 04:15 AM   #15
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default

Quote:
Originally Posted by Kittykat22 View Post
Hi everyone,
very helpful responses so far! Retrieving the gene names works great for me, but unfortunately just for differentially expressed genes. If I want to look at splicing or isoforms I cannot get it to work. I just get a list of identifiers or, if I try with the above path and use "Isoforms" or "Splicing" instead of "Genes" the list is empty.
I am very new to all these things so not sure what I am doing wrong or what I have to do to determine the significantly differentially expressed isoforms, splicing, promoters,...and get a list including the gene names.
Will be very greatful for any help!
Cheers,
K
Hi Kittykat22,
This is something that I'm actively working on including in a future release of cummeRbund. Right now, it's very easy to retrieve all information for a particular gene, however, several people have asked for a 'getFeatures()' method (similar to getGenes()) that would retrieve just the information you are looking for. I will try to post to this thread when I have it working, but also, please keep checking the website for an update.

As an alternative, you can get the significantly different isoforms list by using getSig (level='isoforms'). And you can retrieve ALL isoform annotation by using 'features(isoforms(cuff))' and/or 'fpkm(isoforms(cuff))'. You should be able to filter those data.frames using the list generated from the call to getSig().

Cheers,
Loyal
lgoff is offline   Reply With Quote
Old 04-01-2012, 04:18 AM   #16
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default

Kittykat22,

e.g.
Code:
library(cummeRbund)
cuff<-readCufflinks()

allIsoformFPKMs<-fpkm(isoforms(cuff))
allIsoformFeatures<-features(isoforms(cuff))

sigIsoforms<-getSig(cuff,level='isoforms')

sigIsoformFeatures<-allIsoformFeatures[allIsoformFeatures$isoform_id %in% sigIsoforms,]
sigIsoformFPKMs<-allIsoformFPKMs[allIsoformFPKMs$isoform_id %in% sigIsoforms,]
Cheers,
Loyal
lgoff is offline   Reply With Quote
Old 04-01-2012, 06:29 PM   #17
Kittykat22
Junior Member
 
Location: sydney

Join Date: Mar 2012
Posts: 4
Default

Hi Loyal,

thanks so much for your help and your quick response! I can get the gene identifier now and even though I havent figured out how to include the gene_short_name, Im sure Ill get there!

K
Kittykat22 is offline   Reply With Quote
Old 04-03-2012, 05:09 PM   #18
mondongho
Junior Member
 
Location: Glasgow, Scotland

Join Date: Mar 2012
Posts: 6
Default

Hi All,

I know that Cuffdiff does pairwise differential expression testing, however I have more than two groups A, B ... F. I am not interested in all possible group pairings, just: A vs. B, C vs. D and E vs. F.

Follow the protocol published in nature my code would be:

>gene.diff.data <-diffData(genes(cuff_data), 'A','B')
>sig.gene.data <-subset(gene.diff.data, (significant=='yes'))
>nrow(sig.gene.data)
>write.table(sig.gene.data, 'diff_genes_AvsB.txt', sep= '\t', row.names = F, col.names =T, quote = F)

However, I dont get the "gene_short_name" in my final table.

Then follow the discussion in this post, I'd start with:

>diffGeneIDs <- getSig(cuff,level='genes','A','B',alpha=0.05)

I followed the code posted here but after:

>row.names(diffGenesData)=diffGenesData$gene_id

I get:

Quote:
Error in `row.names<-.data.frame`(`*tmp*`, value = c("XLOC_000028", "XLOC_000028", : duplicate 'row.names' are not allowed
In addition: Warning message:non-unique values when setting 'row.names': ‘XLOC_... [... truncated]
any idea what I'm doing wrong.

Thanks
mondongho is offline   Reply With Quote
Old 04-05-2012, 04:39 AM   #19
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

Quote:
Originally Posted by mondongho View Post
Hi All,

I know that Cuffdiff does pairwise differential expression testing, however I have more than two groups A, B ... F. I am not interested in all possible group pairings, just: A vs. B, C vs. D and E vs. F.

Follow the protocol published in nature my code would be:

>gene.diff.data <-diffData(genes(cuff_data), 'A','B')
>sig.gene.data <-subset(gene.diff.data, (significant=='yes'))
>nrow(sig.gene.data)
>write.table(sig.gene.data, 'diff_genes_AvsB.txt', sep= '\t', row.names = F, col.names =T, quote = F)

However, I dont get the "gene_short_name" in my final table.

Then follow the discussion in this post, I'd start with:

>diffGeneIDs <- getSig(cuff,level='genes','A','B',alpha=0.05)

I followed the code posted here but after:

>row.names(diffGenesData)=diffGenesData$gene_id

I get:



any idea what I'm doing wrong.

Thanks
I got the exact same error, and I also have no idea what I'm doing wrong. My goal is to switch out the gene identifier ID with the gene short name in my write table.

Assuming, I can get this fixed and merge the data using Thomas' code, would my next step be:

write.table(diffGenesOutput, 'name.txt', sep= '\t', row.names = F, col.names =T, quote = F)

With diffGenesOutput being the merged file that was created with Thomas' code.

I am also very clearly a newbie
billstevens is offline   Reply With Quote
Old 04-05-2012, 07:05 AM   #20
billstevens
Senior Member
 
Location: Baltimore

Join Date: Mar 2012
Posts: 120
Default

OK, so the error was because of the duplicate IDs, so I guess I may need a unique command or something.

But I just went around this whole thing the long way, and from Loyal's suggestion, I used featureNames. So now I have two tables, one that has all the information like the Xloc ID, fold change, Pvalue, etc., and other that just has the Xloc ID and the tracking ID. So I can merge these together in access, but what are these tracking IDs? How do I go from these to the names of genes?
billstevens is offline   Reply With Quote
Reply

Tags
cuffdiff, cufflinks, cuffmerge, cummerbund, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:10 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO