SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Generally Applicable Gene-set Analysis (GAGE) problem wilson90 Bioinformatics 1 08-19-2013 09:28 AM
Gene set enrichment analysis of RNA-Seq data jel4h Bioinformatics 1 06-21-2012 04:25 AM

Reply
 
Thread Tools
Old 03-12-2014, 07:10 AM   #21
shriram
Member
 
Location: UK

Join Date: May 2010
Posts: 13
Default

you have to use cuff.fc instead of exp.fc I guess.
This is R issue as you didn't had this object
shriram is offline   Reply With Quote
Old 03-12-2014, 07:32 AM   #22
shocker8786
Member
 
Location: Urbana Illinois

Join Date: Jan 2013
Posts: 28
Default

changing range(exp.fc) to range(cuff.fc) results in the following result printed by R:
[1] -Inf Inf

So is this a typo in the RNAseq workflow, or am I already supposed to have exp.fc defined? I'm trying to go through these pdfs, but I'm confused as to what this step is doing.
shocker8786 is offline   Reply With Quote
Old 03-12-2014, 05:36 PM   #23
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

This is indeed a typo, it should be cuff.fc instead of exp.fc at this step. And you will assign cuff.fc to exp.fc in a later step, and then work exclusively on exp.fc:
range(cuff.fc)
#remove the -Inf and Inf values, which block the downstream analysis
cuff.fc[cuff.fc>10]=10
cuff.fc[cuff.fc< -10]=-10
exp.fc=cuff.fc
out.suffix="cuff"


BTW, the demo example uses human data, and you have to specify organism to be Sus scrofa (pig) for your data when you map gene sybmols to entrez gene IDs as below:
gnames.eg=pathview::id2eg(gnames, category ="symbol", org=="Ss")

Last edited by bigmw; 03-13-2014 at 12:40 PM.
bigmw is offline   Reply With Quote
Old 03-13-2014, 01:26 AM   #24
shocker8786
Member
 
Location: Urbana Illinois

Join Date: Jan 2013
Posts: 28
Default

Thanks for your reply, and for informing me about the org="Ss" command. Unfortunately I am still unable to make it through the workflow without error. Below is my code and final output. Do you see what the issue could be? Thanks.

> kg.ssc=kegg.gsets("ssc")
> kegg.gs=kg.ssc$kg.sets[kg.ssc$sigmet.idx]
> cuff.res=read.delim(file="gene_exp.diff", sep="\t")
> cuff.fc=cuff.res$log2.fold_change.
> gnames=cuff.res$gene
> sel=gnames!="-"
> gnames=as.character(gnames[sel])
> cuff.fc=cuff.fc[sel]
> names(cuff.fc)=gnames
> gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.Ss.eg.db

> sel2=gnames.eg[,2]>""
> cuff.fc=cuff.fc[sel2]
> names(cuff.fc)=gnames.eg[sel2,2]
> range(cuff.fc)
[1] -Inf Inf
> cuff.fc[cuff.fc>10]=10
> cuff.fc[cuff.fc< -10]=-10
> exp.fc=cuff.fc
> out.suffix="cuff"
> require(gage)
> data(kegg.gs)
> fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)
> sel <- fc.kegg.p$greater[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$greater[, "q.val"])
> path.ids <- rownames(fc.kegg.p$greater)[sel]
> sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$less[, "q.val"])
> path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
> path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)
> require(pathview)
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(
+ gene.data = exp.fc, pathway.id = pid,
+ species = "ssc", out.suffix=out.suffix))
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
shocker8786 is offline   Reply With Quote
Old 03-13-2014, 12:38 PM   #25
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

Looks like your downloading had some problem and you didn’t download any pathway data file. Can you confirm on this?
bigmw is offline   Reply With Quote
Old 03-14-2014, 01:29 AM   #26
shocker8786
Member
 
Location: Urbana Illinois

Join Date: Jan 2013
Posts: 28
Default

I think you are right, but I'm not sure how exactly to download the pathway data files. I thought the step you told me to change (below) would download the package I need, but it fails:

gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.ssc.eg.db
Bioconductor version 2.13 (BiocInstaller 1.12.0), ?biocLite for help
BioC_mirror: http://bioconductor.org
Using Bioconductor version 2.13 (BiocInstaller 1.12.0), R version 3.0.3.
Installing package(s) 'org.Ss.eg.db'
Loading required package: org.Ss.eg.db
Error in pathview::id2eg(gnames, category = "symbol", org = "Ss") :
Fail to install/load gene annotation package org.Ss.eg.db!
In addition: Warning messages:
1: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘org.Ss.eg.db’
2: package ‘org.ssc.eg.db’ is not available (for R version 3.0.3)
3: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘org.Ss.eg.db’

I changed org= "Ss" to "Ssc" and "ssc", but the result is the same, there does not seem to be a database with this name. However on the KEGG website, pig is listed as ssc. Is this the proper way to download the database, or is there another step I am missing? Thanks.
shocker8786 is offline   Reply With Quote
Old 03-14-2014, 05:53 PM   #27
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

The files you downloaded are named sscNA.xml hence your path.ids2 are all NA’s. In other words, no real pathways were selected in your gage analysis step. You gage analysis had this problem because you used kegg.gs which are human gene set data not the pig data.
What you did:
data(kegg.gs)
fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)

You should generated pig gene set data first using kegg.gsets function in gage package:
kg. ssc=kegg.gsets("ssc")
kegg.gs=kg. ssc$kg.sets[kg. ssc$sigmet.idx]
fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)

And this is the only problem in your original analysis session I quoted below. It will work once you get this right.
Please always pay attention to the species matching of your own data, gene set or pathway data as documented in gage and pathview packages. Actually you will find everything in the pacakge tutorials and documentations for functions you work with like gage or pathview etc:
http://bioconductor.org/packages/rel...html/gage.html
http://bioconductor.org/packages/rel.../pathview.html


Quote:
Originally Posted by shocker8786 View Post
Thanks for your reply, and for informing me about the org="Ss" command. Unfortunately I am still unable to make it through the workflow without error. Below is my code and final output. Do you see what the issue could be? Thanks.

> kg.ssc=kegg.gsets("ssc")
> kegg.gs=kg.ssc$kg.sets[kg.ssc$sigmet.idx]
> cuff.res=read.delim(file="gene_exp.diff", sep="\t")
> cuff.fc=cuff.res$log2.fold_change.
> gnames=cuff.res$gene
> sel=gnames!="-"
> gnames=as.character(gnames[sel])
> cuff.fc=cuff.fc[sel]
> names(cuff.fc)=gnames
> gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.Ss.eg.db

> sel2=gnames.eg[,2]>""
> cuff.fc=cuff.fc[sel2]
> names(cuff.fc)=gnames.eg[sel2,2]
> range(cuff.fc)
[1] -Inf Inf
> cuff.fc[cuff.fc>10]=10
> cuff.fc[cuff.fc< -10]=-10
> exp.fc=cuff.fc
> out.suffix="cuff"
> require(gage)
> data(kegg.gs)
> fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)
> sel <- fc.kegg.p$greater[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$greater[, "q.val"])
> path.ids <- rownames(fc.kegg.p$greater)[sel]
> sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$less[, "q.val"])
> path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
> path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)
> require(pathview)
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(
+ gene.data = exp.fc, pathway.id = pid,
+ species = "ssc", out.suffix=out.suffix))
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
bigmw is offline   Reply With Quote
Old 03-14-2014, 06:07 PM   #28
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

Your original code worked well in this step as show in your reponse #24 above.
> gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.Ss.eg.db

This line above had no problem. Changing org= "Ss" to "Ssc" and "ssc" caused all the errors. There is no way you will get the input of org="Ss" and message with org.ssc.eg.db like below:
>gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.ssc.eg.db


Quote:
Originally Posted by shocker8786 View Post
I think you are right, but I'm not sure how exactly to download the pathway data files. I thought the step you told me to change (below) would download the package I need, but it fails:

gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.ssc.eg.db
Bioconductor version 2.13 (BiocInstaller 1.12.0), ?biocLite for help
BioC_mirror: http://bioconductor.org
Using Bioconductor version 2.13 (BiocInstaller 1.12.0), R version 3.0.3.
Installing package(s) 'org.Ss.eg.db'
Loading required package: org.Ss.eg.db
Error in pathview::id2eg(gnames, category = "symbol", org = "Ss") :
Fail to install/load gene annotation package org.Ss.eg.db!
In addition: Warning messages:
1: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘org.Ss.eg.db’
2: package ‘org.ssc.eg.db’ is not available (for R version 3.0.3)
3: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘org.Ss.eg.db’

I changed org= "Ss" to "Ssc" and "ssc", but the result is the same, there does not seem to be a database with this name. However on the KEGG website, pig is listed as ssc. Is this the proper way to download the database, or is there another step I am missing? Thanks.
bigmw is offline   Reply With Quote
Old 03-15-2014, 05:17 AM   #29
shocker8786
Member
 
Location: Urbana Illinois

Join Date: Jan 2013
Posts: 28
Default

Thank you for your help. I have 2 data sets that I am trying to perform this analysis on (all pig), and I was indeed able to produce the results files as expected by changing those lines of code for one of the data sets (the one I had not tried this on previously):

> kg.ssc=kegg.gsets("ssc")
> kegg.gs=kg.ssc$kg.sets[kg.ssc$sigmet.idx]
> cuff.res=read.delim(file="gene_exp.diff", sep="\t")
> cuff.fc=cuff.res$log2.fold_change.
> gnames=cuff.res$gene
> sel=gnames!="-"
> gnames=as.character(gnames[sel])
> cuff.fc=cuff.fc[sel]
> names(cuff.fc)=gnames
> gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.Ss.eg.db

> sel2=gnames.eg[,2]>""
> cuff.fc=cuff.fc[sel2]
> names(cuff.fc)=gnames.eg[sel2,2]
> range(cuff.fc)
[1] -Inf Inf
> cuff.fc[cuff.fc>10]=10
> cuff.fc[cuff.fc< -10]=-10
> exp.fc=cuff.fc
> out.suffix="cuff"
> require(gage)
> kg.ssc=kegg.gsets("ssc")
> kegg.gs=kg.ssc$kg.sets[kg.ssc$sigmet.idx]
> fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)
> sel <- fc.kegg.p$greater[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$greater[, "q.val"])
> path.ids <- rownames(fc.kegg.p$greater)[sel]
> sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$less[, "q.val"])
> path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
> path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)
> require(pathview)
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(
+ gene.data = exp.fc, pathway.id = pid,
+ species = "ssc", out.suffix=out.suffix))
Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04151.cuff.png
Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04010.cuff.png
Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04080.cuff.png

However, with my original data set, I still receive an error after entering the code exactly the same way:

> cuff.res=read.delim(file="gene_exp.diff", sep="\t")
> cuff.fc=cuff.res$log2.fold_change.
> gnames=cuff.res$gene
> sel=gnames!="-"
> gnames=as.character(gnames[sel])
> cuff.fc=cuff.fc[sel]
> names(cuff.fc)=gnames
> gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.Ss.eg.db

> sel2=gnames.eg[,2]>""
> cuff.fc=cuff.fc[sel2]
> names(cuff.fc)=gnames.eg[sel2,2]
> range(cuff.fc)
[1] -Inf Inf
> cuff.fc[cuff.fc>10]=10
> cuff.fc[cuff.fc< -10]=-10
> exp.fc=cuff.fc
> out.suffix="cuff"
> require(gage)
> kg.ssc=kegg.gsets("ssc")
> kegg.gs=kg.ssc$kg.sets[kg.ssc$sigmet.idx]
> fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)
> sel <- fc.kegg.p$greater[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$greater[, "q.val"])
> path.ids <- rownames(fc.kegg.p$greater)[sel]
> sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$less[, "q.val"])
> path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
> path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)
> require(pathview)
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(
+ gene.data = exp.fc, pathway.id = pid,
+ species = "ssc", out.suffix=out.suffix))
[1] "Downloading xml files for sscNA, 1/1 pathways.."
[1] "Downloading png files for sscNA, 1/1 pathways.."
Download of sscNA xml and png files failed!
Failed to download KEGG xml/png files, sscNA skipped!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Warning message:
In download.file(png.url, png.target, quiet = T, mode = "wb") :
cannot open: HTTP status was '404 Not Found'

It seems that R is remembering that I was using the human kegg database before, and now I cannot get it to use the pig one. I deleted the sscNA.xml files and unlinked the previous workspace using the unlink(".RData") command, but am still unable to produce the result.
shocker8786 is offline   Reply With Quote
Old 03-15-2014, 06:08 PM   #30
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

R does not remember the old kegg.gs, if you assign the name to a new gene set data. If you are sure your code was right, then very likely your data had some problem. To locate the problem, you can restart a new R session and run the same analysis code on your problematic data till the pathview step. If you still get the same error message, run the following code, and post your output here so that we can check what happened to your data:

str(cuff.fc)
head(cuff.fc)
lapply(kegg.gs[1:3], head, 3)
head(fc.kegg.p$greater)
bigmw is offline   Reply With Quote
Old 03-17-2014, 02:11 AM   #31
shocker8786
Member
 
Location: Urbana Illinois

Join Date: Jan 2013
Posts: 28
Default

I retried the code and was still not successful with one of my datasets. I am sure that the code I am using is the same for both datasets (I have been copying and pasting the code for both datasets). Below is the code and the results for the commands you asked me to enter for the dataset that did not work properly.

> cuff.res=read.delim(file="gene_exp.diff", sep="\t")
> cuff.fc=cuff.res$log2.fold_change.
> gnames=cuff.res$gene
> sel=gnames!="-"
> gnames=as.character(gnames[sel])
> cuff.fc=cuff.fc[sel]
> names(cuff.fc)=gnames
> gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.Ss.eg.db

> sel2=gnames.eg[,2]>""
> cuff.fc=cuff.fc[sel2]
> names(cuff.fc)=gnames.eg[sel2,2]
> range(cuff.fc)
[1] -Inf Inf
> cuff.fc[cuff.fc>10]=10
> cuff.fc[cuff.fc< -10]=-10
> exp.fc=cuff.fc
> out.suffix="cuff"
> require(gage)
> kg.ssc=kegg.gsets("ssc")
> kegg.gs=kg.ssc$kg.sets[kg.ssc$sigmet.idx]
> fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)
> sel <- fc.kegg.p$greater[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$greater[, "q.val"])
> path.ids <- rownames(fc.kegg.p$greater)[sel]
> sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$less[, "q.val"])
> path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
> path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)
> require(pathview)
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(
+ gene.data = exp.fc, pathway.id = pid,
+ species = "ssc", out.suffix=out.suffix))
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
> str(cuff.fc)
Named num [1:7434] 0.6134 0 0.3767 0.0177 -2.1022 ...
- attr(*, "names")= chr [1:7434] "100620481" "100512748" "100513321" "100154202" ...
> head(cuff.fc)
100620481 100512748 100513321 100154202 733660 100153485
0.6133590 0.0000000 0.3766890 0.0177489 -2.1022400 -0.2612780
> lapply(kegg.gs[1:3], head, 3)
$`ssc00970 Aminoacyl-tRNA biosynthesis`
[1] "100144457" "100153423" "100155115"

$`ssc02010 ABC transporters`
[1] "100048963" "100127449" "100144586"

$`ssc03008 Ribosome biogenesis in eukaryotes`
[1] "100151786" "100151831" "100152658"

> head(fc.kegg.p$greater)
p.geomean stat.mean p.val
ssc04064 NF-kappa B signaling pathway 0.02469658 1.990928 0.02469658
ssc04620 Toll-like receptor signaling pathway 0.03526210 1.828538 0.03526210
ssc04066 HIF-1 signaling pathway 0.04658115 1.693048 0.04658115
ssc04640 Hematopoietic cell lineage 0.04929811 1.670237 0.04929811
ssc04670 Leukocyte transendothelial migration 0.04995310 1.657766 0.04995310
ssc04610 Complement and coagulation cascades 0.07305357 1.467898 0.07305357
q.val set.size exp1
ssc04064 NF-kappa B signaling pathway 0.8234987 53 0.02469658
ssc04620 Toll-like receptor signaling pathway 0.8234987 58 0.03526210
ssc04066 HIF-1 signaling pathway 0.8234987 59 0.04658115
ssc04640 Hematopoietic cell lineage 0.8234987 49 0.04929811
ssc04670 Leukocyte transendothelial migration 0.8234987 63 0.04995310
ssc04610 Complement and coagulation cascades 0.8234987 44 0.07305357

Below are the results for the dataset that did work properly:

Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04151.cuff.png
Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04010.cuff.png
Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04080.cuff.png
> str(cuff.fc)
Named num [1:344025] 1.279 0 0.155 0.806 -0.958 ...
- attr(*, "names")= chr [1:344025] "100620481" "100512748" "100513321" "100154202" ...
> head(cuff.fc)
100620481 100512748 100513321 100154202 100156632 733660
1.278660 0.000000 0.154690 0.806045 -0.958121 -0.103950
> lapply(kegg.gs[1:3], head, 3)
$`ssc00970 Aminoacyl-tRNA biosynthesis`
[1] "100144457" "100153423" "100155115"

$`ssc02010 ABC transporters`
[1] "100048963" "100127449" "100144586"

$`ssc03008 Ribosome biogenesis in eukaryotes`
[1] "100151786" "100151831" "100152658"

> head(fc.kegg.p$greater)
p.geomean stat.mean p.val
ssc00770 Pantothenate and CoA biosynthesis 0.9995680 -3.988628 0.9995680
ssc00630 Glyoxylate and dicarboxylate metabolism 0.9995962 -3.942065 0.9995962
ssc00052 Galactose metabolism 0.9999012 -4.281375 0.9999012
ssc04740 Olfactory transduction 0.9999254 -4.576407 0.9999254
ssc00100 Steroid biosynthesis 0.9999490 -4.852319 0.9999490
ssc00970 Aminoacyl-tRNA biosynthesis 0.9999580 -4.829070 0.9999580
q.val set.size exp1
ssc00770 Pantothenate and CoA biosynthesis 1 10 0.9995680
ssc00630 Glyoxylate and dicarboxylate metabolism 1 11 0.9995962
ssc00052 Galactose metabolism 1 15 0.9999012
ssc04740 Olfactory transduction 1 12 0.9999254
ssc00100 Steroid biosynthesis 1 11 0.9999490
ssc00970 Aminoacyl-tRNA biosynthesis 1 12 0.9999580
shocker8786 is offline   Reply With Quote
Old 03-17-2014, 03:04 PM   #32
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Couldnt get gnCnt to work, I read in HTseq files instead like this:

fls <- list.files(path="~/RNAseq/02_MG-myocytes/MG_Myocytes_HTSeq/A1A2pT2D",pattern="*.txt", full.names =T)

#tab separated values with a header
datalist = lapply(fls, function(x)read.table(x, header=F, colClasses=c("NULL", "numeric"), nrow = nrow(read.table(x, header = F)) - 5))

#same header/columns for all files
datafr = do.call("cbind", datalist)

#column names
colnames(datafr) <- c("D102-A1","D102-A2","D104-A1","D104-A2","D117-A1","D117-A2","D121-A1","D121-A2","D153-A1","D153-A2","D155-A1","D155-A2","D161-A1","D161-A2","D162-A1","D162-A2","D167-A1","D167-A2","D173-A1","D173-A2","D176-A1","D176-A2","D177-A1","D177-A2","D179-A1","D179-A2")

#row names
x <- read.table(file = "~/RNAseq/02_MG-myocytes/MG_Myocytes_HTSeq/A1A2pT2D/D102-A1.txt", header=F, colClasses=c( "character", "NULL"), nrow = nrow(read.table(file = "~/RNAseq/02_MG-myocytes/MG_Myocytes_HTSeq/A1A2pT2D/D102-A1.txt", header = F)) - 5)

rownames(datafr) <- x$V1

hnrnp.cnts <- datafr

Last edited by sindrle; 03-18-2014 at 05:52 AM. Reason: Did an alternative route
sindrle is offline   Reply With Quote
Old 03-17-2014, 07:09 PM   #33
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

Your kegg.gs has no problem, and it is the same for both datasets:
> lapply(kegg.gs[1:3], head, 3)
$`ssc00970 Aminoacyl-tRNA biosynthesis`
[1] "100144457" "100153423" "100155115"


The problem resides in yoru expression data. In your problematic dataset (let’s call it dataset 1), you have only 7434 entries (not exactly genes):
> str(cuff.fc)
Named num [1:7434] 0.6134 0 0.3767 0.0177 -2.1022 ...


In your good dataset 2, you have 344025 entries:
> str(cuff.fc)
Named num [1:344025] 1.279 0 0.155 0.806 -0.958 ...


Pig has 57700 genes in total based on NCBI data. Assume your dataset 2 has a full or decent coverage on pig genes, each pig gene maps to about 6 entries in this dataset. Assume the same level of redundancy, there are only ~ 1200 unique genes in your dataset 1. In other words, this is likely a sublist of selected or significant genes. Pathway or gene set analysis like GAGE requires a comprehensive and balance coverage of genes for your research speices, e.g. the full list of genes measured in a microarray or RNA-seq study, usually in the order of several thousand or tens of thousands of genes.
Therefore, very likely you gage analysis of dataset 1 doesn’t yield any significant pathways due to the biased preslected/short gene list with only ~ 1200 genes. Hence you get NA pathway IDs when you select significant pathways. Your dataset 2 does give you some sensible results because you seem to have a good and balanced coverage of pig genes there.

However, you should really make sure each gene ID has only 1 expression value when doing gene set or pathway analysis because genes are treated as independent variables. In the GAGE/Pathview native workflow or joint workflow with DESeq/DESeq2/edgeR/limma, we explicitly mapped reads to Entrez Genes and summarized expression counts per genes. But your data were mapped and processed by Cufflinks likely to some transcript sequences like RefSeq mRNA’s (with Entrez Gene annotation). You should summarize the expression data for multiple transcripts of the same gene into 1 unified expression level, and then conduct pathway analysis. We describe how to do so in a secondary gage tutorial, “Gene set and data preparation”, please check Section 5-gene or transcript ID conversion:
http://bioconductor.org/packages/rel...c/dataPrep.pdf


Quote:
Originally Posted by shocker8786 View Post
I retried the code and was still not successful with one of my datasets. I am sure that the code I am using is the same for both datasets (I have been copying and pasting the code for both datasets). Below is the code and the results for the commands you asked me to enter for the dataset that did not work properly.

> cuff.res=read.delim(file="gene_exp.diff", sep="\t")
> cuff.fc=cuff.res$log2.fold_change.
> gnames=cuff.res$gene
> sel=gnames!="-"
> gnames=as.character(gnames[sel])
> cuff.fc=cuff.fc[sel]
> names(cuff.fc)=gnames
> gnames.eg=pathview::id2eg(gnames, category ="symbol", org="Ss")
Loading required package: org.Ss.eg.db

> sel2=gnames.eg[,2]>""
> cuff.fc=cuff.fc[sel2]
> names(cuff.fc)=gnames.eg[sel2,2]
> range(cuff.fc)
[1] -Inf Inf
> cuff.fc[cuff.fc>10]=10
> cuff.fc[cuff.fc< -10]=-10
> exp.fc=cuff.fc
> out.suffix="cuff"
> require(gage)
> kg.ssc=kegg.gsets("ssc")
> kegg.gs=kg.ssc$kg.sets[kg.ssc$sigmet.idx]
> fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)
> sel <- fc.kegg.p$greater[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$greater[, "q.val"])
> path.ids <- rownames(fc.kegg.p$greater)[sel]
> sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 &
+ !is.na(fc.kegg.p$less[, "q.val"])
> path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
> path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)
> require(pathview)
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(
+ gene.data = exp.fc, pathway.id = pid,
+ species = "ssc", out.suffix=out.suffix))
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
Start tag expected, '<' not found
Parsing ./sscNA.xml file failed, please check the file!
> str(cuff.fc)
Named num [1:7434] 0.6134 0 0.3767 0.0177 -2.1022 ...
- attr(*, "names")= chr [1:7434] "100620481" "100512748" "100513321" "100154202" ...
> head(cuff.fc)
100620481 100512748 100513321 100154202 733660 100153485
0.6133590 0.0000000 0.3766890 0.0177489 -2.1022400 -0.2612780
> lapply(kegg.gs[1:3], head, 3)
$`ssc00970 Aminoacyl-tRNA biosynthesis`
[1] "100144457" "100153423" "100155115"

$`ssc02010 ABC transporters`
[1] "100048963" "100127449" "100144586"

$`ssc03008 Ribosome biogenesis in eukaryotes`
[1] "100151786" "100151831" "100152658"

> head(fc.kegg.p$greater)
p.geomean stat.mean p.val
ssc04064 NF-kappa B signaling pathway 0.02469658 1.990928 0.02469658
ssc04620 Toll-like receptor signaling pathway 0.03526210 1.828538 0.03526210
ssc04066 HIF-1 signaling pathway 0.04658115 1.693048 0.04658115
ssc04640 Hematopoietic cell lineage 0.04929811 1.670237 0.04929811
ssc04670 Leukocyte transendothelial migration 0.04995310 1.657766 0.04995310
ssc04610 Complement and coagulation cascades 0.07305357 1.467898 0.07305357
q.val set.size exp1
ssc04064 NF-kappa B signaling pathway 0.8234987 53 0.02469658
ssc04620 Toll-like receptor signaling pathway 0.8234987 58 0.03526210
ssc04066 HIF-1 signaling pathway 0.8234987 59 0.04658115
ssc04640 Hematopoietic cell lineage 0.8234987 49 0.04929811
ssc04670 Leukocyte transendothelial migration 0.8234987 63 0.04995310
ssc04610 Complement and coagulation cascades 0.8234987 44 0.07305357

Below are the results for the dataset that did work properly:

Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04151.cuff.png
Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04010.cuff.png
Working in directory /srv/mds01/shared/Epigenetics/Rnaseq/kyle_results/TJT/tophat_expression/cuffdiff
Writing image file ssc04080.cuff.png
> str(cuff.fc)
Named num [1:344025] 1.279 0 0.155 0.806 -0.958 ...
- attr(*, "names")= chr [1:344025] "100620481" "100512748" "100513321" "100154202" ...
> head(cuff.fc)
100620481 100512748 100513321 100154202 100156632 733660
1.278660 0.000000 0.154690 0.806045 -0.958121 -0.103950
> lapply(kegg.gs[1:3], head, 3)
$`ssc00970 Aminoacyl-tRNA biosynthesis`
[1] "100144457" "100153423" "100155115"

$`ssc02010 ABC transporters`
[1] "100048963" "100127449" "100144586"

$`ssc03008 Ribosome biogenesis in eukaryotes`
[1] "100151786" "100151831" "100152658"

> head(fc.kegg.p$greater)
p.geomean stat.mean p.val
ssc00770 Pantothenate and CoA biosynthesis 0.9995680 -3.988628 0.9995680
ssc00630 Glyoxylate and dicarboxylate metabolism 0.9995962 -3.942065 0.9995962
ssc00052 Galactose metabolism 0.9999012 -4.281375 0.9999012
ssc04740 Olfactory transduction 0.9999254 -4.576407 0.9999254
ssc00100 Steroid biosynthesis 0.9999490 -4.852319 0.9999490
ssc00970 Aminoacyl-tRNA biosynthesis 0.9999580 -4.829070 0.9999580
q.val set.size exp1
ssc00770 Pantothenate and CoA biosynthesis 1 10 0.9995680
ssc00630 Glyoxylate and dicarboxylate metabolism 1 11 0.9995962
ssc00052 Galactose metabolism 1 15 0.9999012
ssc04740 Olfactory transduction 1 12 0.9999254
ssc00100 Steroid biosynthesis 1 11 0.9999490
ssc00970 Aminoacyl-tRNA biosynthesis 1 12 0.9999580
bigmw is offline   Reply With Quote
Old 03-18-2014, 01:27 AM   #34
shocker8786
Member
 
Location: Urbana Illinois

Join Date: Jan 2013
Posts: 28
Default

Thank you very much for that explanation, this is starting to make more sense to me now. I will make the necessary changes and retry. Thanks again!
shocker8786 is offline   Reply With Quote
Old 03-18-2014, 04:14 AM   #35
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Im a retard. Forgot to convert kegg.gs to gene symbols:

data(egSymb)

kegg.gs.sym<-lapply(kegg.gs, eg2sym)
Attached Images
File Type: png Screen Shot 2014-03-18 at 13.12.36.png (29.0 KB, 13 views)

Last edited by sindrle; 03-18-2014 at 05:51 AM. Reason: I have low iq
sindrle is offline   Reply With Quote
Old 03-18-2014, 11:03 AM   #36
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

sindrle,
I couldn’t find your original problem, but if I remember correctly, your summarizeOverlaps step didn’t work. And you did something like:

flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=TRUE)
param <- ScanBamParam(flag=flag)
gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, single.end=TRUE, param=param)


I guess you have single end data, so try:
flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=NA)
The following flag line is for paired end data:
flag <- scanBamFlag(isNotPrimaryRead=FALSE, isProperPair=TRUE)

Check help info for details:
?scanBamFlag
bigmw is offline   Reply With Quote
Old 03-18-2014, 11:06 AM   #37
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Ok, thanks!
I was thinking about the same, but I figured since I have already ran HTseq I just used the results from there.

Good to know until next time.
sindrle is offline   Reply With Quote
Old 03-18-2014, 11:09 AM   #38
shriram
Member
 
Location: UK

Join Date: May 2010
Posts: 13
Angry

Hi,
I was looking at the expression data from pathview object and original data supplied to pathview, I found discrepancies in the two values:

Original count data (human data from reference manual):
> cnts.d["120", ]
ERR127302 ERR127303 ERR127304 ERR127305
0.277 0.577 0.441 0.021

output from Pathview object for human data from reference manual
>pv.out.list[1]
ERR127302 ERR127303 ERR127304 ERR127305
0.3952 0.8842 1.628 1.1983

Does pathview transforms the original data before plotting on the kegg pathways?

THanks,
shriram is offline   Reply With Quote
Old 03-18-2014, 06:58 PM   #39
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

If you are look for the gene entry"120" underpv.out.list[[1]]$plot.data.gene, it may or may not be different from the original data, cnts.d, because multiple genes may be mapped to the same nodes in a KEGG pathway. What you found in $plot.data.gene is the node level summary instead of single gene data. Each node there is labeled or named after the most representative member gene.

Please check the pathview tutorial (page 7) and documentations for more details:
http://bioconductor.org/packages/rel.../pathview.html
http://bioconductor.org/packages/rel...c/pathview.pdf
bigmw is offline   Reply With Quote
Old 03-19-2014, 08:00 AM   #40
shriram
Member
 
Location: UK

Join Date: May 2010
Posts: 13
Default Fold change ploting

Thank you for your reply, now I understand why there are differences at node level data and actual fold changes.
However when I want to compare the fold change that I see in the input data and on the KEGG pathway it doesn't correlate well.
e.g.
pv.out.list[[1]]
Gene A:
T1 T2 T3 T4 T5
0.66 4.1079 0.830 3.278 2.71

input fold change values:
Gene A
-1.2 0.7 -0.14 -0.48 -0.78

Color coding for this gene on pathway node:

#EF3030 #FF0000 #FF0000 #FF0000 #FF0000

which doesn't reflect the trend seen in the input data.

When I change node.sum="median" it little bit shows same trend as input fold changes.
pv.out.list[[1]]
-1.15 0.7626 0.000 1.387 1.534
#00FF00 #EF3030 #BEBEBE #FF0000 #FF0000

It is bit confusing for me.


Thanks
shriram is offline   Reply With Quote
Reply

Tags
gene set, pathway analysis, r/bioconductor, rna-seq, visualization

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:15 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO