Seqanswers Leaderboard Ad

**lindenb** · 06-18-2014, 01:13 AM

cross posted: https://www.biostars.org/p/103871/

**Parharn** · 06-18-2014, 01:23 AM

Yes! I posted it in both forums since I need help! Is it against rules or something?! Are seqanswers.com and biostars.org related somehow?!

**bigmw** · 06-18-2014, 10:30 AM

The files you downloaded are named spoNA.xml hence your path.ids2 are all NA’s. In other words, no real pathways were selected in your gage analysis step. You gage analysis had this problem because you used kg.spo which is a list containing gene set data but not the gene set to use.
What you did:
kg.spo=kegg.gsets("spo")
fc.kegg.p <- gage(exp.fc, gsets=kg.spo, ref=NULL, samp=NULL)

You process the right gene set data out before calling gage:
kg.spo=kegg.gsets("spo")
kegg.gs=kg. spo$kg.sets[kg. spo$sigmet.idx]
fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)
…

You can check the details on your gene set data:
names(kg.spo)
lapply(kg.spo, head, 3)
lapply(kegg.gs[1:3], head, 3)

you may also check the function documentation:
?kegg.gsets

Please always pay attention to the species and ID type matching of your own data, gene set or pathway data as documented in gage and pathview packages. Actually you will find everything in the pacakge tutorials and documentations for functions you work with like gage or pathview etc:

gage

http://bioconductor.org/packages/release/bioc/html/gage.html

GAGE is a published method for gene set (enrichment or GSEA) or pathway analysis. GAGE is generally applicable independent of microarray or RNA-Seq data attributes including sample sizes, experimental designs, assay platforms, and other types of heterogeneity, and consistently achieves superior performance over other frequently used methods. In gage package, we provide functions for basic GAGE analysis, result processing and presentation. We have also built pipeline routines for of multiple GAGE analyses in a batch, comparison between parallel analyses, and combined analysis of heterogeneous data from different sources/studies. In addition, we provide demo microarray data and commonly used gene set data based on KEGG pathways and GO terms. These funtions and data are also useful for gene set analysis using other methods.

pathview

http://bioconductor.org/packages/release/bioc/html/pathview.html

Pathview is a tool set for pathway based data integration and visualization. It maps and renders a wide variety of biological data on relevant pathway graphs. All users need is to supply their data and specify the target pathway. Pathview automatically downloads the pathway graph data, parses the data file, maps user data to the pathway, and render pathway graph with the mapped data. In addition, Pathview also seamlessly integrates with pathway and gene set (enrichment) analysis tools for large-scale and fully automated analysis.

A similar problem with gene set data was posted before:

RNA-Seq Pathway and Gene-set Analysis Workflows in R/Bioconductor with GAGE/Pathview - SEQanswers

http://seqanswers.com/forums/showthread.php?t=34679&page=2#24

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**Parharn** · 06-19-2014, 01:18 AM

Thanks for help, as I understood I should write the script as follow, which again I end up with the same error! I just copied anything in case you see something wrong in middle!

Code:

> require(gage)
Loading required package: gage
> kg.spo=kegg.gsets("spo")
> kegg.gs=kg.spo$kg.sets[kg.spo$sigmet.idx]
> fc.kegg.p <- gage(exp.fc, gsets=kegg.gs, ref=NULL, samp=NULL)
> sel <- fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[, "q.val"])
> path.ids <- rownames(fc.kegg.p$greater)[sel]
> sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 & !is.na(fc.kegg.p$less[, "q.val"])
> path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
> path.ids2 <- substr(c(path.ids, path.ids.l),1 ,8)
> require(pathview)
Loading required package: pathview
Loading required package: KEGGgraph
Loading required package: XML
Loading required package: graph

Attaching package: ‘graph’

The following object is masked from ‘package:XML’:

    addNode

The following object is masked from ‘package:Biostrings’:

    complement

Loading required package: org.Hs.eg.db
Loading required package: DBI

##############################################################################
Pathview is an open source software package distributed under GNU General
Public License version 3 (GPLv3). Details of GPLv3 is available at
[url]http://www.gnu.org/licenses/gpl-3.0.html[/url].

The pathview downloads and uses KEGG data. Academic users may freely use the
KEGG website at [url]http://www.kegg.jp/[/url] or its mirror site at GenomeNet
[url]http://www.genome.jp/kegg/[/url]. Academic users may also freely link to the KEGG
website. Non-academic users may use the KEGG website as end users for
non-commercial purposes, but any other use requires a license agreement
(details at [url]http://www.kegg.jp/kegg/legal.html[/url]).
##############################################################################
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data= exp.fc, pathway.id=pid, species="spo", out.suffix=outsuffix))
Getting gene ID data from KEGG...
Done with data retrieval!
[1] "Downloading xml files for spoNA, 1/1 pathways.."
[1] "Downloading png files for spoNA, 1/1 pathways.."
Download of spoNA xml and png files failed!
Failed to download KEGG xml/png files, spoNA skipped!
Getting gene ID data from KEGG...
Done with data retrieval!
Start tag expected, '<' not found
Parsing ./spoNA.xml file failed, please check the file!
Getting gene ID data from KEGG...
Done with data retrieval!
Start tag expected, '<' not found
Parsing ./spoNA.xml file failed, please check the file!
Warning message:
In download.file(png.url, png.target, quiet = T, mode = "wb") :
  cannot open: HTTP status was '404 Not Found'

**bigmw** · 06-19-2014, 10:27 AM

Why don’t you run the following lines before your call pathview function. And then post the output here. We will see what happened.

lapply(kegg.gs[1:3], head, 3)
length(kegg.gs)
head(exp.fc)
str(exp.fc)
head(path.ids)
head(path.ids2)

**Parharn** · 06-19-2014, 11:48 AM

Ok, these are the outputs, where am I doing it wrong?!

Code:

> lapply(kegg.gs[1:3], head, 3)
$`spo00970 Aminoacyl-tRNA biosynthesis`
[1] "SPAC17A5.15c" "SPAC1805.09c" "SPAC23A1.12c"

$`spo02010 ABC transporters`
[1] "SPAC15A10.01" "SPBC25B2.02c" "SPCC663.03"  

$`spo03008 Ribosome biogenesis in eukaryotes`
[1] "SPAC10F6.10"   "SPAC12G12.06c" "SPAC1486.09"

> length(kegg.gs)
[1] 99
> head(exp.fc)
 SPAC1002.01  SPAC1002.02 SPAC1002.03c SPAC1002.04c SPAC1002.05c SPAC1002.06c 
 -0.53922325  -0.26937946  -0.15941217   0.09840531  -0.28588415  -0.30728702 
> str(exp.fc)
 Named num [1:5968] -0.5392 -0.2694 -0.1594 0.0984 -0.2859 ...
 - attr(*, "names")= chr [1:5968] "SPAC1002.01" "SPAC1002.02" "SPAC1002.03c" "SPAC1002.04c" ...
> head(path.ids)
character(0)
> head(path.ids2)
character(0)

**bigmw** · 06-19-2014, 01:36 PM

no pathways were selected as path.ids has 0 length. Can you post the output from:
head(fc.kegg.p$greater)

**Parharn** · 06-19-2014, 02:14 PM

Code:

> head(fc.kegg.p$greater)
                                           p.geomean stat.mean       p.val
spo03013 RNA transport                   0.002445005  2.852610 0.002445005
spo00330 Arginine and proline metabolism 0.007452659  2.521399 0.007452659
spo00300 Lysine biosynthesis             0.009222205  2.592031 0.009222205
spo03020 RNA polymerase                  0.011308357  2.356088 0.011308357
spo00240 Pyrimidine metabolism           0.014016117  2.225798 0.014016117
spo00010 Glycolysis / Gluconeogenesis    0.018634680  2.133161 0.018634680
                                             q.val set.size        exp1
spo03013 RNA transport                   0.1515903       90 0.002445005
spo00330 Arginine and proline metabolism 0.1737999       27 0.007452659
spo00300 Lysine biosynthesis             0.1737999       10 0.009222205
spo03020 RNA polymerase                  0.1737999       25 0.011308357
spo00240 Pyrimidine metabolism           0.1737999       57 0.014016117
spo00010 Glycolysis / Gluconeogenesis    0.1925584       32 0.018634680

**bigmw** · 06-19-2014, 06:25 PM

Your gage step did work. You output above shows the top (most significant) pathways based on GAGE analysis. (You may want to check whether these pathways make sense biologically.)
However, none of them have q-value less than 0.1, hence you selected no significant pathways based on this criteria. This is common due to the noise level, your sample size etc. You just need to use less stringent criteria to call significant pathways. Use q-value cutoff 0.2 instead of 0.1 as in the lines below:
sel <- fc.kegg.p$greater[, "q.val"] < 0.2 & !is.na(fc.kegg.p$greater[, "q.val"])
sel.l <- fc.kegg.p$less[, "q.val"] < 0.2 & !is.na(fc.kegg.p$less[, "q.val"])

BTW, I would suggest you get your self familiar with R syntax and really understand the basics of gage/pathview packages. And you will find your work gets much easier.

**Parharn** · 06-20-2014, 12:59 AM

Thanks a million bigmw, it worked! At least it downloaded the pathways I mean, but still there are some issues. Regarding R I am learning it, but still it's a very big subject that the pace of learning it is lower than its application. Bioinformatic is not my main field of studies, but I am trying to learn it. However, could you do a favour and see if you know what can be the source of the problem I am having here:

Code:

Getting gene ID data from KEGG...
Done with data retrieval!
[1] "Downloading xml files for spo03013, 1/1 pathways.."
[1] "Downloading png files for spo03013, 1/1 pathways.."
No of the genes or compounds mapped to the pathway!
Argument gene.idtype or cpd.idtype may be wrong.
Working in directory /home/parham/pombe/working_dir/gage03
Writing image file spo03013.deseq2.png

This happens for all pathways! Here is a link to the .png of downloaded pathways. There are just the pathways with no up/down regulation!

**bigmw** · 06-20-2014, 03:41 PM

The output message suggests that arguement gene.idtype may be wrong. The gene IDs in both your input data and your gene set data are KEGG gene IDs (KEGG uses Locus tag in this case). So you need to set that as below:
pv.out.list <- sapply(path.ids2, function(pid) pathview(gene.data= exp.fc, pathway.id=pid, species="spo", out.suffix=outsuffix, gene.idtype="KEGG"))

Note that you need to do path.ids2 instead of path.ids2[1:3] in the code above to include all selected pathways rather than the top 3 only (copy&paste from the demo example).

In pathview function, please look into the argument gene.idtype. Default gene.idtype="entrez",i.e. Entrez Gene, which are the primary KEGG gene ID for many common model organisms. For other species, gene.idtype should be set to "KEGG" as KEGG use other types of gene IDs. For more details check the help info:
?pathview
more details are also available in section 7.5 Working with species of the tutorial:

http://www.bioconductor.org/packages/release/bioc/vignettes/pathview/inst/doc/pathview.pdf

**Parharn** · 06-21-2014, 04:33 AM

Thanks a million bigmw, it made it work. I had read the pathview tutorial before but I couldn't figure out that part very well! You made it clear to me. And yes about the path.ids2 I could figure that out also, but thanks for reminding it. Without your help these steps could have take much much longer! Brilliant!

**Parharn** · 06-22-2014, 04:55 AM

I have one more question! In part 6.5 of "RNA-seqworkflow" working on cufflinks data. how should I change the ID conversion line? Should I leave it as it is or KEGG?

Code:

> gnames.eg=pathview::id2eg(gnames, category ="symbol")

**bigmw** · 06-23-2014, 09:09 AM

Here is a relevant thread you started previously:

Entrez ID for GAGE - SEQanswers

http://seqanswers.com/forums/showthread.php?t=41489#5

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

You need to download gene_info.gz from NCBI and follow the instructions in the old thread.

Again, pathview::id2eg only works for major species with annotation package in Bioconductor, not for S. pombe. For details:
data(bods, package="pathview"); bods

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Parsing ./spoNA.xml file failed, please check the file!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News