![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
dbSNP XML file to CSV conversion | tushardave | Bioinformatics | 2 | 05-12-2014 09:59 AM |
How to obtain gene ids from an .xml file | Hel | Bioinformatics | 2 | 04-08-2014 09:09 AM |
Parsing a GenBank file | Rob2012 | Bioinformatics | 3 | 11-12-2013 01:49 PM |
get mate pair info from xml file | c_ro87 | Bioinformatics | 0 | 02-22-2012 08:10 AM |
Help!! need a copy of config.xml file from GAIIx run | liux | Illumina/Solexa | 2 | 11-04-2011 07:46 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Sweden Join Date: Jul 2013
Posts: 84
|
![]()
I am trying the pathway analysis from this workflow (1). I recieve error that parsing some file failed! Other question I have is, if it is enough to just change the species="hsa" to species="spo" for s. pombe or I need to change the codes upstream also? (1) RNA-Seq Data Pathway and Gene-set Analysis Workflows - Weijun Luo.
Code:
> gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, param=param) > dim(gnCnt) [1] 7017 2 > pfh1.cnts=assay(gnCnt) > cnts=pfh1.cnts > dim(cnts) [1] 7017 2 > sel.rn=rowSums(cnts) !=0 > cnts=cnts[sel.rn,] > dim(cnts) [1] 5968 2 > library(DESeq2) Loading required package: Rcpp Loading required package: RcppArmadillo > grp.idx <- rep(c("cotnrol", "experiment") + > grp.idx <- rep(c("cotnrol", "experiment")) > coldat=DataFrame(grp=factor(grp.idx)) > dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design= ~ grp) > dds <- DESeq(dds) estimating size factors estimating dispersions same number of samples and coefficients to fit, estimating dispersion by treating samples as replicates gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing > deseq2.res <- results(dds) > deseq2.fc=deseq2.res$log2FoldChange > names(deseq2.fc)=rownames(deseq2.res) > exp.fc=deseq2.fc > out.suffix="deseq2" > require(gage) Loading required package: gage > kg.spo=kegg.gsets("spo") > fc.kegg.p <- gage(exp.fc, gsets=kg.spo, ref=NULL, samp=NULL) > sel <- fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[, "q.val"]) > path.ids <- rownames(fc.kegg.p$greater)[sel] > sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 & !is.na(fc.kegg.p$less[,"q.val"]) > path.ids.1 <- rownames(fc.kegg.p$less)[sel.l] > path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8) Error in substr(c(path.ids, path.ids.l), 1, 8) : error in evaluating the argument 'x' in selecting a method for function 'substr': Error: object 'path.ids.l' not found > path.ids2 <- substr(c(path.ids, path.ids.1), 1, 8) > require(pathview) Loading required package: pathview Loading required package: KEGGgraph Loading required package: XML Loading required package: graph Attaching package: ‘graph’ The following object is masked from ‘package:XML’: addNode The following object is masked from ‘package:Biostrings’: complement Loading required package: org.Hs.eg.db Loading required package: DBI ############################################################################## Pathview is an open source software package distributed under GNU General Public License version 3 (GPLv3). Details of GPLv3 is available at http://www.gnu.org/licenses/gpl-3.0.html. The pathview downloads and uses KEGG data. Academic users may freely use the KEGG website at http://www.kegg.jp/ or its mirror site at GenomeNet http://www.genome.jp/kegg/. Academic users may also freely link to the KEGG website. Non-academic users may use the KEGG website as end users for non-commercial purposes, but any other use requires a license agreement (details at http://www.kegg.jp/kegg/legal.html). ############################################################################## > pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data= exp.fc, pathway.id = pid, species= "spo", out.suffix=out.suffix)) Getting gene ID data from KEGG... Done with data retrieval! [1] "Downloading xml files for spoNA, 1/1 pathways.." [1] "Downloading png files for spoNA, 1/1 pathways.." Download of spoNA xml and png files failed! Failed to download KEGG xml/png files, spoNA skipped! Getting gene ID data from KEGG... Done with data retrieval! Start tag expected, '<' not found Parsing ./spoNA.xml file failed, please check the file! Getting gene ID data from KEGG... Done with data retrieval! Start tag expected, '<' not found Parsing ./spoNA.xml file failed, please check the file! Warning message: In download.file(png.url, png.target, quiet = T, mode = "wb") : cannot open: HTTP status was '404 Not Found' |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: France Join Date: Apr 2010
Posts: 143
|
![]()
cross posted: https://www.biostars.org/p/103871/
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Sweden Join Date: Jul 2013
Posts: 84
|
![]()
Yes! I posted it in both forums since I need help! Is it against rules or something?! Are seqanswers.com and biostars.org related somehow?!
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: US Join Date: Aug 2013
Posts: 123
|
![]()
The files you downloaded are named spoNA.xml hence your path.ids2 are all NA’s. In other words, no real pathways were selected in your gage analysis step. You gage analysis had this problem because you used kg.spo which is a list containing gene set data but not the gene set to use.
What you did: kg.spo=kegg.gsets("spo") fc.kegg.p <- gage(exp.fc, gsets=kg.spo, ref=NULL, samp=NULL) You process the right gene set data out before calling gage: kg.spo=kegg.gsets("spo") kegg.gs=kg. spo$kg.sets[kg. spo$sigmet.idx] fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL) … You can check the details on your gene set data: names(kg.spo) lapply(kg.spo, head, 3) lapply(kegg.gs[1:3], head, 3) you may also check the function documentation: ?kegg.gsets Please always pay attention to the species and ID type matching of your own data, gene set or pathway data as documented in gage and pathview packages. Actually you will find everything in the pacakge tutorials and documentations for functions you work with like gage or pathview etc: http://bioconductor.org/packages/rel...html/gage.html http://bioconductor.org/packages/rel.../pathview.html A similar problem with gene set data was posted before: http://seqanswers.com/forums/showthr...4679&page=2#24 |
![]() |
![]() |
![]() |
#5 |
Member
Location: Sweden Join Date: Jul 2013
Posts: 84
|
![]()
Thanks for help, as I understood I should write the script as follow, which again I end up with the same error! I just copied anything in case you see something wrong in middle!
Code:
> require(gage) Loading required package: gage > kg.spo=kegg.gsets("spo") > kegg.gs=kg.spo$kg.sets[kg.spo$sigmet.idx] > fc.kegg.p <- gage(exp.fc, gsets=kegg.gs, ref=NULL, samp=NULL) > sel <- fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[, "q.val"]) > path.ids <- rownames(fc.kegg.p$greater)[sel] > sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 & !is.na(fc.kegg.p$less[, "q.val"]) > path.ids.l <- rownames(fc.kegg.p$less)[sel.l] > path.ids2 <- substr(c(path.ids, path.ids.l),1 ,8) > require(pathview) Loading required package: pathview Loading required package: KEGGgraph Loading required package: XML Loading required package: graph Attaching package: ‘graph’ The following object is masked from ‘package:XML’: addNode The following object is masked from ‘package:Biostrings’: complement Loading required package: org.Hs.eg.db Loading required package: DBI ############################################################################## Pathview is an open source software package distributed under GNU General Public License version 3 (GPLv3). Details of GPLv3 is available at http://www.gnu.org/licenses/gpl-3.0.html. The pathview downloads and uses KEGG data. Academic users may freely use the KEGG website at http://www.kegg.jp/ or its mirror site at GenomeNet http://www.genome.jp/kegg/. Academic users may also freely link to the KEGG website. Non-academic users may use the KEGG website as end users for non-commercial purposes, but any other use requires a license agreement (details at http://www.kegg.jp/kegg/legal.html). ############################################################################## > pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data= exp.fc, pathway.id=pid, species="spo", out.suffix=outsuffix)) Getting gene ID data from KEGG... Done with data retrieval! [1] "Downloading xml files for spoNA, 1/1 pathways.." [1] "Downloading png files for spoNA, 1/1 pathways.." Download of spoNA xml and png files failed! Failed to download KEGG xml/png files, spoNA skipped! Getting gene ID data from KEGG... Done with data retrieval! Start tag expected, '<' not found Parsing ./spoNA.xml file failed, please check the file! Getting gene ID data from KEGG... Done with data retrieval! Start tag expected, '<' not found Parsing ./spoNA.xml file failed, please check the file! Warning message: In download.file(png.url, png.target, quiet = T, mode = "wb") : cannot open: HTTP status was '404 Not Found' |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: US Join Date: Aug 2013
Posts: 123
|
![]()
Why don’t you run the following lines before your call pathview function. And then post the output here. We will see what happened.
lapply(kegg.gs[1:3], head, 3) length(kegg.gs) head(exp.fc) str(exp.fc) head(path.ids) head(path.ids2) |
![]() |
![]() |
![]() |
#7 |
Member
Location: Sweden Join Date: Jul 2013
Posts: 84
|
![]()
Ok, these are the outputs, where am I doing it wrong?!
Code:
> lapply(kegg.gs[1:3], head, 3) $`spo00970 Aminoacyl-tRNA biosynthesis` [1] "SPAC17A5.15c" "SPAC1805.09c" "SPAC23A1.12c" $`spo02010 ABC transporters` [1] "SPAC15A10.01" "SPBC25B2.02c" "SPCC663.03" $`spo03008 Ribosome biogenesis in eukaryotes` [1] "SPAC10F6.10" "SPAC12G12.06c" "SPAC1486.09" > length(kegg.gs) [1] 99 > head(exp.fc) SPAC1002.01 SPAC1002.02 SPAC1002.03c SPAC1002.04c SPAC1002.05c SPAC1002.06c -0.53922325 -0.26937946 -0.15941217 0.09840531 -0.28588415 -0.30728702 > str(exp.fc) Named num [1:5968] -0.5392 -0.2694 -0.1594 0.0984 -0.2859 ... - attr(*, "names")= chr [1:5968] "SPAC1002.01" "SPAC1002.02" "SPAC1002.03c" "SPAC1002.04c" ... > head(path.ids) character(0) > head(path.ids2) character(0) |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: US Join Date: Aug 2013
Posts: 123
|
![]()
no pathways were selected as path.ids has 0 length. Can you post the output from:
head(fc.kegg.p$greater) |
![]() |
![]() |
![]() |
#9 |
Member
Location: Sweden Join Date: Jul 2013
Posts: 84
|
![]() Code:
> head(fc.kegg.p$greater) p.geomean stat.mean p.val spo03013 RNA transport 0.002445005 2.852610 0.002445005 spo00330 Arginine and proline metabolism 0.007452659 2.521399 0.007452659 spo00300 Lysine biosynthesis 0.009222205 2.592031 0.009222205 spo03020 RNA polymerase 0.011308357 2.356088 0.011308357 spo00240 Pyrimidine metabolism 0.014016117 2.225798 0.014016117 spo00010 Glycolysis / Gluconeogenesis 0.018634680 2.133161 0.018634680 q.val set.size exp1 spo03013 RNA transport 0.1515903 90 0.002445005 spo00330 Arginine and proline metabolism 0.1737999 27 0.007452659 spo00300 Lysine biosynthesis 0.1737999 10 0.009222205 spo03020 RNA polymerase 0.1737999 25 0.011308357 spo00240 Pyrimidine metabolism 0.1737999 57 0.014016117 spo00010 Glycolysis / Gluconeogenesis 0.1925584 32 0.018634680 |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: US Join Date: Aug 2013
Posts: 123
|
![]()
Your gage step did work. You output above shows the top (most significant) pathways based on GAGE analysis. (You may want to check whether these pathways make sense biologically.)
However, none of them have q-value less than 0.1, hence you selected no significant pathways based on this criteria. This is common due to the noise level, your sample size etc. You just need to use less stringent criteria to call significant pathways. Use q-value cutoff 0.2 instead of 0.1 as in the lines below: sel <- fc.kegg.p$greater[, "q.val"] < 0.2 & !is.na(fc.kegg.p$greater[, "q.val"]) sel.l <- fc.kegg.p$less[, "q.val"] < 0.2 & !is.na(fc.kegg.p$less[, "q.val"]) BTW, I would suggest you get your self familiar with R syntax and really understand the basics of gage/pathview packages. And you will find your work gets much easier. |
![]() |
![]() |
![]() |
#11 |
Member
Location: Sweden Join Date: Jul 2013
Posts: 84
|
![]()
Thanks a million bigmw, it worked! At least it downloaded the pathways I mean, but still there are some issues. Regarding R I am learning it, but still it's a very big subject that the pace of learning it is lower than its application. Bioinformatic is not my main field of studies, but I am trying to learn it. However, could you do a favour and see if you know what can be the source of the problem I am having here:
Code:
Getting gene ID data from KEGG... Done with data retrieval! [1] "Downloading xml files for spo03013, 1/1 pathways.." [1] "Downloading png files for spo03013, 1/1 pathways.." No of the genes or compounds mapped to the pathway! Argument gene.idtype or cpd.idtype may be wrong. Working in directory /home/parham/pombe/working_dir/gage03 Writing image file spo03013.deseq2.png Last edited by Parharn; 06-20-2014 at 02:02 AM. |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: US Join Date: Aug 2013
Posts: 123
|
![]()
The output message suggests that arguement gene.idtype may be wrong. The gene IDs in both your input data and your gene set data are KEGG gene IDs (KEGG uses Locus tag in this case). So you need to set that as below:
pv.out.list <- sapply(path.ids2, function(pid) pathview(gene.data= exp.fc, pathway.id=pid, species="spo", out.suffix=outsuffix, gene.idtype="KEGG")) Note that you need to do path.ids2 instead of path.ids2[1:3] in the code above to include all selected pathways rather than the top 3 only (copy&paste from the demo example). In pathview function, please look into the argument gene.idtype. Default gene.idtype="entrez",i.e. Entrez Gene, which are the primary KEGG gene ID for many common model organisms. For other species, gene.idtype should be set to "KEGG" as KEGG use other types of gene IDs. For more details check the help info: ?pathview more details are also available in section 7.5 Working with species of the tutorial: http://www.bioconductor.org/packages...c/pathview.pdf |
![]() |
![]() |
![]() |
#13 |
Member
Location: Sweden Join Date: Jul 2013
Posts: 84
|
![]()
Thanks a million bigmw, it made it work. I had read the pathview tutorial before but I couldn't figure out that part very well! You made it clear to me. And yes about the path.ids2 I could figure that out also, but thanks for reminding it. Without your help these steps could have take much much longer! Brilliant!
|
![]() |
![]() |
![]() |
#14 |
Member
Location: Sweden Join Date: Jul 2013
Posts: 84
|
![]()
I have one more question! In part 6.5 of "RNA-seqworkflow" working on cufflinks data. how should I change the ID conversion line? Should I leave it as it is or KEGG?
Code:
> gnames.eg=pathview::id2eg(gnames, category ="symbol") |
![]() |
![]() |
![]() |
#15 |
Senior Member
Location: US Join Date: Aug 2013
Posts: 123
|
![]()
Here is a relevant thread you started previously:
http://seqanswers.com/forums/showthread.php?t=41489#5 You need to download gene_info.gz from NCBI and follow the instructions in the old thread. Again, pathview::id2eg only works for major species with annotation package in Bioconductor, not for S. pombe. For details: data(bods, package="pathview"); bods |
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: US Join Date: Aug 2013
Posts: 123
|
![]()
BTW, if the gene IDs look similar to the KEGG gene ID (locus tag), like "SPAC15A10.01" "SPBC25B2.02c" etc, you don’t need to convert the gene IDs. Because they are already the same as those used in the KEGG pathways or gene set data (kegg.gs). you can check the format of the gene ID of cufflinks output:
head(gnames) |
![]() |
![]() |
![]() |
Tags |
kegg, pathway analysis, s.pombe |
Thread Tools | |
|
|