SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
dbSNP XML file to CSV conversion tushardave Bioinformatics 2 05-12-2014 09:59 AM
How to obtain gene ids from an .xml file Hel Bioinformatics 2 04-08-2014 09:09 AM
Parsing a GenBank file Rob2012 Bioinformatics 3 11-12-2013 01:49 PM
get mate pair info from xml file c_ro87 Bioinformatics 0 02-22-2012 08:10 AM
Help!! need a copy of config.xml file from GAIIx run liux Illumina/Solexa 2 11-04-2011 07:46 AM

Reply
 
Thread Tools
Old 06-18-2014, 01:21 AM   #1
Parharn
Member
 
Location: Sweden

Join Date: Jul 2013
Posts: 84
Default Parsing ./spoNA.xml file failed, please check the file!

I am trying the pathway analysis from this workflow (1). I recieve error that parsing some file failed! Other question I have is, if it is enough to just change the species="hsa" to species="spo" for s. pombe or I need to change the codes upstream also? (1) RNA-Seq Data Pathway and Gene-set Analysis Workflows - Weijun Luo.

Code:
> gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, param=param)
> dim(gnCnt)
[1] 7017    2
> pfh1.cnts=assay(gnCnt)
> cnts=pfh1.cnts
> dim(cnts)
[1] 7017    2
> sel.rn=rowSums(cnts) !=0
> cnts=cnts[sel.rn,]
> dim(cnts)
[1] 5968    2
> library(DESeq2)
Loading required package: Rcpp
Loading required package: RcppArmadillo
> grp.idx <- rep(c("cotnrol", "experiment")
+ 
> grp.idx <- rep(c("cotnrol", "experiment"))
> coldat=DataFrame(grp=factor(grp.idx))
> dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design= ~ grp)
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
same number of samples and coefficients to fit, estimating dispersion by treating samples as replicates
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> deseq2.res <- results(dds)
> deseq2.fc=deseq2.res$log2FoldChange
> names(deseq2.fc)=rownames(deseq2.res)
> exp.fc=deseq2.fc
> out.suffix="deseq2"
> require(gage)
Loading required package: gage
> kg.spo=kegg.gsets("spo")
> fc.kegg.p <- gage(exp.fc, gsets=kg.spo, ref=NULL, samp=NULL)
> sel <- fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[, "q.val"])
> path.ids <- rownames(fc.kegg.p$greater)[sel]
> sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 & !is.na(fc.kegg.p$less[,"q.val"])
> path.ids.1 <- rownames(fc.kegg.p$less)[sel.l]
> path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)
Error in substr(c(path.ids, path.ids.l), 1, 8) : 
  error in evaluating the argument 'x' in selecting a method for function 'substr': Error: object 'path.ids.l' not found
> path.ids2 <- substr(c(path.ids, path.ids.1), 1, 8)
> require(pathview)
Loading required package: pathview
Loading required package: KEGGgraph
Loading required package: XML
Loading required package: graph

Attaching package: ‘graph’

The following object is masked from ‘package:XML’:

    addNode

The following object is masked from ‘package:Biostrings’:

    complement

Loading required package: org.Hs.eg.db
Loading required package: DBI

##############################################################################
Pathview is an open source software package distributed under GNU General
Public License version 3 (GPLv3). Details of GPLv3 is available at
http://www.gnu.org/licenses/gpl-3.0.html.

The pathview downloads and uses KEGG data. Academic users may freely use the
KEGG website at http://www.kegg.jp/ or its mirror site at GenomeNet
http://www.genome.jp/kegg/. Academic users may also freely link to the KEGG
website. Non-academic users may use the KEGG website as end users for
non-commercial purposes, but any other use requires a license agreement
(details at http://www.kegg.jp/kegg/legal.html).
##############################################################################
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data= exp.fc, pathway.id = pid, species= "spo", out.suffix=out.suffix))
Getting gene ID data from KEGG...
Done with data retrieval!
[1] "Downloading xml files for spoNA, 1/1 pathways.."
[1] "Downloading png files for spoNA, 1/1 pathways.."
Download of spoNA xml and png files failed!
Failed to download KEGG xml/png files, spoNA skipped!
Getting gene ID data from KEGG...
Done with data retrieval!
Start tag expected, '<' not found
Parsing ./spoNA.xml file failed, please check the file!
Getting gene ID data from KEGG...
Done with data retrieval!
Start tag expected, '<' not found
Parsing ./spoNA.xml file failed, please check the file!
Warning message:
In download.file(png.url, png.target, quiet = T, mode = "wb") :
  cannot open: HTTP status was '404 Not Found'
Parharn is offline   Reply With Quote
Old 06-18-2014, 02:13 AM   #2
lindenb
Senior Member
 
Location: France

Join Date: Apr 2010
Posts: 143
Default

cross posted: https://www.biostars.org/p/103871/
lindenb is offline   Reply With Quote
Old 06-18-2014, 02:23 AM   #3
Parharn
Member
 
Location: Sweden

Join Date: Jul 2013
Posts: 84
Default

Yes! I posted it in both forums since I need help! Is it against rules or something?! Are seqanswers.com and biostars.org related somehow?!
Parharn is offline   Reply With Quote
Old 06-18-2014, 11:30 AM   #4
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

The files you downloaded are named spoNA.xml hence your path.ids2 are all NA’s. In other words, no real pathways were selected in your gage analysis step. You gage analysis had this problem because you used kg.spo which is a list containing gene set data but not the gene set to use.
What you did:
kg.spo=kegg.gsets("spo")
fc.kegg.p <- gage(exp.fc, gsets=kg.spo, ref=NULL, samp=NULL)

You process the right gene set data out before calling gage:
kg.spo=kegg.gsets("spo")
kegg.gs=kg. spo$kg.sets[kg. spo$sigmet.idx]
fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)


You can check the details on your gene set data:
names(kg.spo)
lapply(kg.spo, head, 3)
lapply(kegg.gs[1:3], head, 3)

you may also check the function documentation:
?kegg.gsets

Please always pay attention to the species and ID type matching of your own data, gene set or pathway data as documented in gage and pathview packages. Actually you will find everything in the pacakge tutorials and documentations for functions you work with like gage or pathview etc:
http://bioconductor.org/packages/rel...html/gage.html
http://bioconductor.org/packages/rel.../pathview.html

A similar problem with gene set data was posted before:
http://seqanswers.com/forums/showthr...4679&page=2#24
bigmw is offline   Reply With Quote
Old 06-19-2014, 02:18 AM   #5
Parharn
Member
 
Location: Sweden

Join Date: Jul 2013
Posts: 84
Default

Thanks for help, as I understood I should write the script as follow, which again I end up with the same error! I just copied anything in case you see something wrong in middle!

Code:
> require(gage)
Loading required package: gage
> kg.spo=kegg.gsets("spo")
> kegg.gs=kg.spo$kg.sets[kg.spo$sigmet.idx]
> fc.kegg.p <- gage(exp.fc, gsets=kegg.gs, ref=NULL, samp=NULL)
> sel <- fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[, "q.val"])
> path.ids <- rownames(fc.kegg.p$greater)[sel]
> sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 & !is.na(fc.kegg.p$less[, "q.val"])
> path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
> path.ids2 <- substr(c(path.ids, path.ids.l),1 ,8)
> require(pathview)
Loading required package: pathview
Loading required package: KEGGgraph
Loading required package: XML
Loading required package: graph

Attaching package: ‘graph’

The following object is masked from ‘package:XML’:

    addNode

The following object is masked from ‘package:Biostrings’:

    complement

Loading required package: org.Hs.eg.db
Loading required package: DBI

##############################################################################
Pathview is an open source software package distributed under GNU General
Public License version 3 (GPLv3). Details of GPLv3 is available at
http://www.gnu.org/licenses/gpl-3.0.html.

The pathview downloads and uses KEGG data. Academic users may freely use the
KEGG website at http://www.kegg.jp/ or its mirror site at GenomeNet
http://www.genome.jp/kegg/. Academic users may also freely link to the KEGG
website. Non-academic users may use the KEGG website as end users for
non-commercial purposes, but any other use requires a license agreement
(details at http://www.kegg.jp/kegg/legal.html).
##############################################################################
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data= exp.fc, pathway.id=pid, species="spo", out.suffix=outsuffix))
Getting gene ID data from KEGG...
Done with data retrieval!
[1] "Downloading xml files for spoNA, 1/1 pathways.."
[1] "Downloading png files for spoNA, 1/1 pathways.."
Download of spoNA xml and png files failed!
Failed to download KEGG xml/png files, spoNA skipped!
Getting gene ID data from KEGG...
Done with data retrieval!
Start tag expected, '<' not found
Parsing ./spoNA.xml file failed, please check the file!
Getting gene ID data from KEGG...
Done with data retrieval!
Start tag expected, '<' not found
Parsing ./spoNA.xml file failed, please check the file!
Warning message:
In download.file(png.url, png.target, quiet = T, mode = "wb") :
  cannot open: HTTP status was '404 Not Found'
Parharn is offline   Reply With Quote
Old 06-19-2014, 11:27 AM   #6
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

Why don’t you run the following lines before your call pathview function. And then post the output here. We will see what happened.

lapply(kegg.gs[1:3], head, 3)
length(kegg.gs)
head(exp.fc)
str(exp.fc)
head(path.ids)
head(path.ids2)
bigmw is offline   Reply With Quote
Old 06-19-2014, 12:48 PM   #7
Parharn
Member
 
Location: Sweden

Join Date: Jul 2013
Posts: 84
Default

Ok, these are the outputs, where am I doing it wrong?!

Code:
> lapply(kegg.gs[1:3], head, 3)
$`spo00970 Aminoacyl-tRNA biosynthesis`
[1] "SPAC17A5.15c" "SPAC1805.09c" "SPAC23A1.12c"

$`spo02010 ABC transporters`
[1] "SPAC15A10.01" "SPBC25B2.02c" "SPCC663.03"  

$`spo03008 Ribosome biogenesis in eukaryotes`
[1] "SPAC10F6.10"   "SPAC12G12.06c" "SPAC1486.09"

> length(kegg.gs)
[1] 99
> head(exp.fc)
 SPAC1002.01  SPAC1002.02 SPAC1002.03c SPAC1002.04c SPAC1002.05c SPAC1002.06c 
 -0.53922325  -0.26937946  -0.15941217   0.09840531  -0.28588415  -0.30728702 
> str(exp.fc)
 Named num [1:5968] -0.5392 -0.2694 -0.1594 0.0984 -0.2859 ...
 - attr(*, "names")= chr [1:5968] "SPAC1002.01" "SPAC1002.02" "SPAC1002.03c" "SPAC1002.04c" ...
> head(path.ids)
character(0)
> head(path.ids2)
character(0)
Parharn is offline   Reply With Quote
Old 06-19-2014, 02:36 PM   #8
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

no pathways were selected as path.ids has 0 length. Can you post the output from:
head(fc.kegg.p$greater)
bigmw is offline   Reply With Quote
Old 06-19-2014, 03:14 PM   #9
Parharn
Member
 
Location: Sweden

Join Date: Jul 2013
Posts: 84
Default

Code:
> head(fc.kegg.p$greater)
                                           p.geomean stat.mean       p.val
spo03013 RNA transport                   0.002445005  2.852610 0.002445005
spo00330 Arginine and proline metabolism 0.007452659  2.521399 0.007452659
spo00300 Lysine biosynthesis             0.009222205  2.592031 0.009222205
spo03020 RNA polymerase                  0.011308357  2.356088 0.011308357
spo00240 Pyrimidine metabolism           0.014016117  2.225798 0.014016117
spo00010 Glycolysis / Gluconeogenesis    0.018634680  2.133161 0.018634680
                                             q.val set.size        exp1
spo03013 RNA transport                   0.1515903       90 0.002445005
spo00330 Arginine and proline metabolism 0.1737999       27 0.007452659
spo00300 Lysine biosynthesis             0.1737999       10 0.009222205
spo03020 RNA polymerase                  0.1737999       25 0.011308357
spo00240 Pyrimidine metabolism           0.1737999       57 0.014016117
spo00010 Glycolysis / Gluconeogenesis    0.1925584       32 0.018634680
Parharn is offline   Reply With Quote
Old 06-19-2014, 07:25 PM   #10
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

Your gage step did work. You output above shows the top (most significant) pathways based on GAGE analysis. (You may want to check whether these pathways make sense biologically.)
However, none of them have q-value less than 0.1, hence you selected no significant pathways based on this criteria. This is common due to the noise level, your sample size etc. You just need to use less stringent criteria to call significant pathways. Use q-value cutoff 0.2 instead of 0.1 as in the lines below:
sel <- fc.kegg.p$greater[, "q.val"] < 0.2 & !is.na(fc.kegg.p$greater[, "q.val"])
sel.l <- fc.kegg.p$less[, "q.val"] < 0.2 & !is.na(fc.kegg.p$less[, "q.val"])

BTW, I would suggest you get your self familiar with R syntax and really understand the basics of gage/pathview packages. And you will find your work gets much easier.
bigmw is offline   Reply With Quote
Old 06-20-2014, 01:59 AM   #11
Parharn
Member
 
Location: Sweden

Join Date: Jul 2013
Posts: 84
Default

Thanks a million bigmw, it worked! At least it downloaded the pathways I mean, but still there are some issues. Regarding R I am learning it, but still it's a very big subject that the pace of learning it is lower than its application. Bioinformatic is not my main field of studies, but I am trying to learn it. However, could you do a favour and see if you know what can be the source of the problem I am having here:

Code:
Getting gene ID data from KEGG...
Done with data retrieval!
[1] "Downloading xml files for spo03013, 1/1 pathways.."
[1] "Downloading png files for spo03013, 1/1 pathways.."
No of the genes or compounds mapped to the pathway!
Argument gene.idtype or cpd.idtype may be wrong.
Working in directory /home/parham/pombe/working_dir/gage03
Writing image file spo03013.deseq2.png
This happens for all pathways! Here is a link to the .png of downloaded pathways. There are just the pathways with no up/down regulation!

Last edited by Parharn; 06-20-2014 at 02:02 AM.
Parharn is offline   Reply With Quote
Old 06-20-2014, 04:41 PM   #12
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

The output message suggests that arguement gene.idtype may be wrong. The gene IDs in both your input data and your gene set data are KEGG gene IDs (KEGG uses Locus tag in this case). So you need to set that as below:
pv.out.list <- sapply(path.ids2, function(pid) pathview(gene.data= exp.fc, pathway.id=pid, species="spo", out.suffix=outsuffix, gene.idtype="KEGG"))

Note that you need to do path.ids2 instead of path.ids2[1:3] in the code above to include all selected pathways rather than the top 3 only (copy&paste from the demo example).

In pathview function, please look into the argument gene.idtype. Default gene.idtype="entrez",i.e. Entrez Gene, which are the primary KEGG gene ID for many common model organisms. For other species, gene.idtype should be set to "KEGG" as KEGG use other types of gene IDs. For more details check the help info:
?pathview
more details are also available in section 7.5 Working with species of the tutorial:
http://www.bioconductor.org/packages...c/pathview.pdf
bigmw is offline   Reply With Quote
Old 06-21-2014, 05:33 AM   #13
Parharn
Member
 
Location: Sweden

Join Date: Jul 2013
Posts: 84
Default

Thanks a million bigmw, it made it work. I had read the pathview tutorial before but I couldn't figure out that part very well! You made it clear to me. And yes about the path.ids2 I could figure that out also, but thanks for reminding it. Without your help these steps could have take much much longer! Brilliant!
Parharn is offline   Reply With Quote
Old 06-22-2014, 05:55 AM   #14
Parharn
Member
 
Location: Sweden

Join Date: Jul 2013
Posts: 84
Default

I have one more question! In part 6.5 of "RNA-seqworkflow" working on cufflinks data. how should I change the ID conversion line? Should I leave it as it is or KEGG?

Code:
> gnames.eg=pathview::id2eg(gnames, category ="symbol")
Parharn is offline   Reply With Quote
Old 06-23-2014, 10:09 AM   #15
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

Here is a relevant thread you started previously:
http://seqanswers.com/forums/showthread.php?t=41489#5
You need to download gene_info.gz from NCBI and follow the instructions in the old thread.

Again, pathview::id2eg only works for major species with annotation package in Bioconductor, not for S. pombe. For details:
data(bods, package="pathview"); bods
bigmw is offline   Reply With Quote
Old 06-24-2014, 11:19 AM   #16
bigmw
Senior Member
 
Location: US

Join Date: Aug 2013
Posts: 123
Default

BTW, if the gene IDs look similar to the KEGG gene ID (locus tag), like "SPAC15A10.01" "SPBC25B2.02c" etc, you don’t need to convert the gene IDs. Because they are already the same as those used in the KEGG pathways or gene set data (kegg.gs). you can check the format of the gene ID of cufflinks output:
head(gnames)


Quote:
Originally Posted by Parharn View Post
I have one more question! In part 6.5 of "RNA-seqworkflow" working on cufflinks data. how should I change the ID conversion line? Should I leave it as it is or KEGG?

Code:
> gnames.eg=pathview::id2eg(gnames, category ="symbol")
bigmw is offline   Reply With Quote
Reply

Tags
kegg, pathway analysis, s.pombe

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:34 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO