View Single Post
Old 03-17-2014, 07:09 PM   #33
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

Your kegg.gs has no problem, and it is the same for both datasets:
> lapply(kegg.gs[1:3], head, 3)
$`ssc00970 Aminoacyl-tRNA biosynthesis`
[1] "100144457" "100153423" "100155115"


The problem resides in yoru expression data. In your problematic dataset (let’s call it dataset 1), you have only 7434 entries (not exactly genes):
> str(cuff.fc)
Named num [1:7434] 0.6134 0 0.3767 0.0177 -2.1022 ...


In your good dataset 2, you have 344025 entries:
> str(cuff.fc)
Named num [1:344025] 1.279 0 0.155 0.806 -0.958 ...


Pig has 57700 genes in total based on NCBI data. Assume your dataset 2 has a full or decent coverage on pig genes, each pig gene maps to about 6 entries in this dataset. Assume the same level of redundancy, there are only ~ 1200 unique genes in your dataset 1. In other words, this is likely a sublist of selected or significant genes. Pathway or gene set analysis like GAGE requires a comprehensive and balance coverage of genes for your research speices, e.g. the full list of genes measured in a microarray or RNA-seq study, usually in the order of several thousand or tens of thousands of genes.
Therefore, very likely you gage analysis of dataset 1 doesn’t yield any significant pathways due to the biased preslected/short gene list with only ~ 1200 genes. Hence you get NA pathway IDs when you select significant pathways. Your dataset 2 does give you some sensible results because you seem to have a good and balanced coverage of pig genes there.

However, you should really make sure each gene ID has only 1 expression value when doing gene set or pathway analysis because genes are treated as independent variables. In the GAGE/Pathview native workflow or joint workflow with DESeq/DESeq2/edgeR/limma, we explicitly mapped reads to Entrez Genes and summarized expression counts per genes. But your data were mapped and processed by Cufflinks likely to some transcript sequences like RefSeq mRNA’s (with Entrez Gene annotation). You should summarize the expression data for multiple transcripts of the same gene into 1 unified expression level, and then conduct pathway analysis. We describe how to do so in a secondary gage tutorial, “Gene set and data preparation”, please check Section 5-gene or transcript ID conversion:
http://bioconductor.org/packages/rel...c/dataPrep.pdf


Quote:
Originally Posted by shocker8786 View Post
I retried the code and was still not successful with one of my datasets. I am sure that the code I am using is the same for both datasets (I have been copying and pasting the code for both datasets). Below is the code and the results for the commands you asked me to enter for the dataset that did not work properly.

> cuff.res=read.delim(file="gene_exp.diff", sep="\t")
> cuff.fc=cuff.res$log2.fold_change.
> gnames=cuff.res$gene
> sel=gnames!="-"
> gnames=as.character(gnames[sel])
> cuff.fc=cuff.fc[sel]
> names(cuff.fc)=gnames
> gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.Ss.eg.db

> sel2=gnames.eg[,2]>""
> cuff.fc=cuff.fc[sel2]
> names(cuff.fc)=gnames.eg[sel2,2]
> range(cuff.fc)
[1] -Inf Inf
> cuff.fc[cuff.fc>10]=10
> cuff.fc[cuff.fc< -10]=-10
> exp.fc=cuff.fc
> out.suffix="cuff"
> require(gage)
> kg.ssc=kegg.gsets("ssc")
> kegg.gs=kg.ssc$kg.sets[kg.ssc$sigmet.idx]
> fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)
> sel <- fc.kegg.p$greater[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$greater[, "q.val"])
> path.ids <- rownames(fc.kegg.p$greater)[sel]
> sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$less[, "q.val"])
> path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
> path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)
> require(pathview)
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(
+ gene.data = exp.fc, pathway.id = pid,
+ species = "ssc", out.suffix=out.suffix))
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
> str(cuff.fc)
Named num [1:7434] 0.6134 0 0.3767 0.0177 -2.1022 ...
- attr(*, "names")= chr [1:7434] "100620481" "100512748" "100513321" "100154202" ...
> head(cuff.fc)
100620481 100512748 100513321 100154202 733660 100153485
0.6133590 0.0000000 0.3766890 0.0177489 -2.1022400 -0.2612780
> lapply(kegg.gs[1:3], head, 3)
$`ssc00970 Aminoacyl-tRNA biosynthesis`
[1] "100144457" "100153423" "100155115"

$`ssc02010 ABC transporters`
[1] "100048963" "100127449" "100144586"

$`ssc03008 Ribosome biogenesis in eukaryotes`
[1] "100151786" "100151831" "100152658"

> head(fc.kegg.p$greater)
p.geomean stat.mean p.val
ssc04064 NF-kappa B signaling pathway 0.02469658 1.990928 0.02469658
ssc04620 Toll-like receptor signaling pathway 0.03526210 1.828538 0.03526210
ssc04066 HIF-1 signaling pathway 0.04658115 1.693048 0.04658115
ssc04640 Hematopoietic cell lineage 0.04929811 1.670237 0.04929811
ssc04670 Leukocyte transendothelial migration 0.04995310 1.657766 0.04995310
ssc04610 Complement and coagulation cascades 0.07305357 1.467898 0.07305357
q.val set.size exp1
ssc04064 NF-kappa B signaling pathway 0.8234987 53 0.02469658
ssc04620 Toll-like receptor signaling pathway 0.8234987 58 0.03526210
ssc04066 HIF-1 signaling pathway 0.8234987 59 0.04658115
ssc04640 Hematopoietic cell lineage 0.8234987 49 0.04929811
ssc04670 Leukocyte transendothelial migration 0.8234987 63 0.04995310
ssc04610 Complement and coagulation cascades 0.8234987 44 0.07305357

Below are the results for the dataset that did work properly:

Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04151.cuff.png
Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04010.cuff.png
Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04080.cuff.png
> str(cuff.fc)
Named num [1:344025] 1.279 0 0.155 0.806 -0.958 ...
- attr(*, "names")= chr [1:344025] "100620481" "100512748" "100513321" "100154202" ...
> head(cuff.fc)
100620481 100512748 100513321 100154202 100156632 733660
1.278660 0.000000 0.154690 0.806045 -0.958121 -0.103950
> lapply(kegg.gs[1:3], head, 3)
$`ssc00970 Aminoacyl-tRNA biosynthesis`
[1] "100144457" "100153423" "100155115"

$`ssc02010 ABC transporters`
[1] "100048963" "100127449" "100144586"

$`ssc03008 Ribosome biogenesis in eukaryotes`
[1] "100151786" "100151831" "100152658"

> head(fc.kegg.p$greater)
p.geomean stat.mean p.val
ssc00770 Pantothenate and CoA biosynthesis 0.9995680 -3.988628 0.9995680
ssc00630 Glyoxylate and dicarboxylate metabolism 0.9995962 -3.942065 0.9995962
ssc00052 Galactose metabolism 0.9999012 -4.281375 0.9999012
ssc04740 Olfactory transduction 0.9999254 -4.576407 0.9999254
ssc00100 Steroid biosynthesis 0.9999490 -4.852319 0.9999490
ssc00970 Aminoacyl-tRNA biosynthesis 0.9999580 -4.829070 0.9999580
q.val set.size exp1
ssc00770 Pantothenate and CoA biosynthesis 1 10 0.9995680
ssc00630 Glyoxylate and dicarboxylate metabolism 1 11 0.9995962
ssc00052 Galactose metabolism 1 15 0.9999012
ssc04740 Olfactory transduction 1 12 0.9999254
ssc00100 Steroid biosynthesis 1 11 0.9999490
ssc00970 Aminoacyl-tRNA biosynthesis 1 12 0.9999580
bigmw is offline   Reply With Quote