Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parsing ./spoNA.xml file failed, please check the file!

    I am trying the pathway analysis from this workflow (1). I recieve error that parsing some file failed! Other question I have is, if it is enough to just change the species="hsa" to species="spo" for s. pombe or I need to change the codes upstream also? (1) RNA-Seq Data Pathway and Gene-set Analysis Workflows - Weijun Luo.

    Code:
    > gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand=TRUE, param=param)
    > dim(gnCnt)
    [1] 7017    2
    > pfh1.cnts=assay(gnCnt)
    > cnts=pfh1.cnts
    > dim(cnts)
    [1] 7017    2
    > sel.rn=rowSums(cnts) !=0
    > cnts=cnts[sel.rn,]
    > dim(cnts)
    [1] 5968    2
    > library(DESeq2)
    Loading required package: Rcpp
    Loading required package: RcppArmadillo
    > grp.idx <- rep(c("cotnrol", "experiment")
    + 
    > grp.idx <- rep(c("cotnrol", "experiment"))
    > coldat=DataFrame(grp=factor(grp.idx))
    > dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design= ~ grp)
    > dds <- DESeq(dds)
    estimating size factors
    estimating dispersions
    same number of samples and coefficients to fit, estimating dispersion by treating samples as replicates
    gene-wise dispersion estimates
    mean-dispersion relationship
    final dispersion estimates
    fitting model and testing
    > deseq2.res <- results(dds)
    > deseq2.fc=deseq2.res$log2FoldChange
    > names(deseq2.fc)=rownames(deseq2.res)
    > exp.fc=deseq2.fc
    > out.suffix="deseq2"
    > require(gage)
    Loading required package: gage
    > kg.spo=kegg.gsets("spo")
    > fc.kegg.p <- gage(exp.fc, gsets=kg.spo, ref=NULL, samp=NULL)
    > sel <- fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[, "q.val"])
    > path.ids <- rownames(fc.kegg.p$greater)[sel]
    > sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 & !is.na(fc.kegg.p$less[,"q.val"])
    > path.ids.1 <- rownames(fc.kegg.p$less)[sel.l]
    > path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)
    Error in substr(c(path.ids, path.ids.l), 1, 8) : 
      error in evaluating the argument 'x' in selecting a method for function 'substr': Error: object 'path.ids.l' not found
    > path.ids2 <- substr(c(path.ids, path.ids.1), 1, 8)
    > require(pathview)
    Loading required package: pathview
    Loading required package: KEGGgraph
    Loading required package: XML
    Loading required package: graph
    
    Attaching package: ‘graph’
    
    The following object is masked from ‘package:XML’:
    
        addNode
    
    The following object is masked from ‘package:Biostrings’:
    
        complement
    
    Loading required package: org.Hs.eg.db
    Loading required package: DBI
    
    ##############################################################################
    Pathview is an open source software package distributed under GNU General
    Public License version 3 (GPLv3). Details of GPLv3 is available at
    http://www.gnu.org/licenses/gpl-3.0.html.
    
    The pathview downloads and uses KEGG data. Academic users may freely use the
    KEGG website at http://www.kegg.jp/ or its mirror site at GenomeNet
    http://www.genome.jp/kegg/. Academic users may also freely link to the KEGG
    website. Non-academic users may use the KEGG website as end users for
    non-commercial purposes, but any other use requires a license agreement
    (details at http://www.kegg.jp/kegg/legal.html).
    ##############################################################################
    > pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data= exp.fc, pathway.id = pid, species= "spo", out.suffix=out.suffix))
    Getting gene ID data from KEGG...
    Done with data retrieval!
    [1] "Downloading xml files for spoNA, 1/1 pathways.."
    [1] "Downloading png files for spoNA, 1/1 pathways.."
    Download of spoNA xml and png files failed!
    Failed to download KEGG xml/png files, spoNA skipped!
    Getting gene ID data from KEGG...
    Done with data retrieval!
    Start tag expected, '<' not found
    Parsing ./spoNA.xml file failed, please check the file!
    Getting gene ID data from KEGG...
    Done with data retrieval!
    Start tag expected, '<' not found
    Parsing ./spoNA.xml file failed, please check the file!
    Warning message:
    In download.file(png.url, png.target, quiet = T, mode = "wb") :
      cannot open: HTTP status was '404 Not Found'

  • #2
    cross posted: https://www.biostars.org/p/103871/

    Comment


    • #3
      Yes! I posted it in both forums since I need help! Is it against rules or something?! Are seqanswers.com and biostars.org related somehow?!

      Comment


      • #4
        The files you downloaded are named spoNA.xml hence your path.ids2 are all NA’s. In other words, no real pathways were selected in your gage analysis step. You gage analysis had this problem because you used kg.spo which is a list containing gene set data but not the gene set to use.
        What you did:
        kg.spo=kegg.gsets("spo")
        fc.kegg.p <- gage(exp.fc, gsets=kg.spo, ref=NULL, samp=NULL)

        You process the right gene set data out before calling gage:
        kg.spo=kegg.gsets("spo")
        kegg.gs=kg. spo$kg.sets[kg. spo$sigmet.idx]
        fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)


        You can check the details on your gene set data:
        names(kg.spo)
        lapply(kg.spo, head, 3)
        lapply(kegg.gs[1:3], head, 3)

        you may also check the function documentation:
        ?kegg.gsets

        Please always pay attention to the species and ID type matching of your own data, gene set or pathway data as documented in gage and pathview packages. Actually you will find everything in the pacakge tutorials and documentations for functions you work with like gage or pathview etc:
        GAGE is a published method for gene set (enrichment or GSEA) or pathway analysis. GAGE is generally applicable independent of microarray or RNA-Seq data attributes including sample sizes, experimental designs, assay platforms, and other types of heterogeneity, and consistently achieves superior performance over other frequently used methods. In gage package, we provide functions for basic GAGE analysis, result processing and presentation. We have also built pipeline routines for of multiple GAGE analyses in a batch, comparison between parallel analyses, and combined analysis of heterogeneous data from different sources/studies. In addition, we provide demo microarray data and commonly used gene set data based on KEGG pathways and GO terms. These funtions and data are also useful for gene set analysis using other methods.

        Pathview is a tool set for pathway based data integration and visualization. It maps and renders a wide variety of biological data on relevant pathway graphs. All users need is to supply their data and specify the target pathway. Pathview automatically downloads the pathway graph data, parses the data file, maps user data to the pathway, and render pathway graph with the mapped data. In addition, Pathview also seamlessly integrates with pathway and gene set (enrichment) analysis tools for large-scale and fully automated analysis.


        A similar problem with gene set data was posted before:

        Comment


        • #5
          Thanks for help, as I understood I should write the script as follow, which again I end up with the same error! I just copied anything in case you see something wrong in middle!

          Code:
          > require(gage)
          Loading required package: gage
          > kg.spo=kegg.gsets("spo")
          > kegg.gs=kg.spo$kg.sets[kg.spo$sigmet.idx]
          > fc.kegg.p <- gage(exp.fc, gsets=kegg.gs, ref=NULL, samp=NULL)
          > sel <- fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[, "q.val"])
          > path.ids <- rownames(fc.kegg.p$greater)[sel]
          > sel.l <- fc.kegg.p$less[, "q.val"] < 0.1 & !is.na(fc.kegg.p$less[, "q.val"])
          > path.ids.l <- rownames(fc.kegg.p$less)[sel.l]
          > path.ids2 <- substr(c(path.ids, path.ids.l),1 ,8)
          > require(pathview)
          Loading required package: pathview
          Loading required package: KEGGgraph
          Loading required package: XML
          Loading required package: graph
          
          Attaching package: ‘graph’
          
          The following object is masked from ‘package:XML’:
          
              addNode
          
          The following object is masked from ‘package:Biostrings’:
          
              complement
          
          Loading required package: org.Hs.eg.db
          Loading required package: DBI
          
          ##############################################################################
          Pathview is an open source software package distributed under GNU General
          Public License version 3 (GPLv3). Details of GPLv3 is available at
          [url]http://www.gnu.org/licenses/gpl-3.0.html[/url].
          
          The pathview downloads and uses KEGG data. Academic users may freely use the
          KEGG website at [url]http://www.kegg.jp/[/url] or its mirror site at GenomeNet
          [url]http://www.genome.jp/kegg/[/url]. Academic users may also freely link to the KEGG
          website. Non-academic users may use the KEGG website as end users for
          non-commercial purposes, but any other use requires a license agreement
          (details at [url]http://www.kegg.jp/kegg/legal.html[/url]).
          ##############################################################################
          > pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data= exp.fc, pathway.id=pid, species="spo", out.suffix=outsuffix))
          Getting gene ID data from KEGG...
          Done with data retrieval!
          [1] "Downloading xml files for spoNA, 1/1 pathways.."
          [1] "Downloading png files for spoNA, 1/1 pathways.."
          Download of spoNA xml and png files failed!
          Failed to download KEGG xml/png files, spoNA skipped!
          Getting gene ID data from KEGG...
          Done with data retrieval!
          Start tag expected, '<' not found
          Parsing ./spoNA.xml file failed, please check the file!
          Getting gene ID data from KEGG...
          Done with data retrieval!
          Start tag expected, '<' not found
          Parsing ./spoNA.xml file failed, please check the file!
          Warning message:
          In download.file(png.url, png.target, quiet = T, mode = "wb") :
            cannot open: HTTP status was '404 Not Found'

          Comment


          • #6
            Why don’t you run the following lines before your call pathview function. And then post the output here. We will see what happened.

            lapply(kegg.gs[1:3], head, 3)
            length(kegg.gs)
            head(exp.fc)
            str(exp.fc)
            head(path.ids)
            head(path.ids2)

            Comment


            • #7
              Ok, these are the outputs, where am I doing it wrong?!

              Code:
              > lapply(kegg.gs[1:3], head, 3)
              $`spo00970 Aminoacyl-tRNA biosynthesis`
              [1] "SPAC17A5.15c" "SPAC1805.09c" "SPAC23A1.12c"
              
              $`spo02010 ABC transporters`
              [1] "SPAC15A10.01" "SPBC25B2.02c" "SPCC663.03"  
              
              $`spo03008 Ribosome biogenesis in eukaryotes`
              [1] "SPAC10F6.10"   "SPAC12G12.06c" "SPAC1486.09"
              
              > length(kegg.gs)
              [1] 99
              > head(exp.fc)
               SPAC1002.01  SPAC1002.02 SPAC1002.03c SPAC1002.04c SPAC1002.05c SPAC1002.06c 
               -0.53922325  -0.26937946  -0.15941217   0.09840531  -0.28588415  -0.30728702 
              > str(exp.fc)
               Named num [1:5968] -0.5392 -0.2694 -0.1594 0.0984 -0.2859 ...
               - attr(*, "names")= chr [1:5968] "SPAC1002.01" "SPAC1002.02" "SPAC1002.03c" "SPAC1002.04c" ...
              > head(path.ids)
              character(0)
              > head(path.ids2)
              character(0)

              Comment


              • #8
                no pathways were selected as path.ids has 0 length. Can you post the output from:
                head(fc.kegg.p$greater)

                Comment


                • #9
                  Code:
                  > head(fc.kegg.p$greater)
                                                             p.geomean stat.mean       p.val
                  spo03013 RNA transport                   0.002445005  2.852610 0.002445005
                  spo00330 Arginine and proline metabolism 0.007452659  2.521399 0.007452659
                  spo00300 Lysine biosynthesis             0.009222205  2.592031 0.009222205
                  spo03020 RNA polymerase                  0.011308357  2.356088 0.011308357
                  spo00240 Pyrimidine metabolism           0.014016117  2.225798 0.014016117
                  spo00010 Glycolysis / Gluconeogenesis    0.018634680  2.133161 0.018634680
                                                               q.val set.size        exp1
                  spo03013 RNA transport                   0.1515903       90 0.002445005
                  spo00330 Arginine and proline metabolism 0.1737999       27 0.007452659
                  spo00300 Lysine biosynthesis             0.1737999       10 0.009222205
                  spo03020 RNA polymerase                  0.1737999       25 0.011308357
                  spo00240 Pyrimidine metabolism           0.1737999       57 0.014016117
                  spo00010 Glycolysis / Gluconeogenesis    0.1925584       32 0.018634680

                  Comment


                  • #10
                    Your gage step did work. You output above shows the top (most significant) pathways based on GAGE analysis. (You may want to check whether these pathways make sense biologically.)
                    However, none of them have q-value less than 0.1, hence you selected no significant pathways based on this criteria. This is common due to the noise level, your sample size etc. You just need to use less stringent criteria to call significant pathways. Use q-value cutoff 0.2 instead of 0.1 as in the lines below:
                    sel <- fc.kegg.p$greater[, "q.val"] < 0.2 & !is.na(fc.kegg.p$greater[, "q.val"])
                    sel.l <- fc.kegg.p$less[, "q.val"] < 0.2 & !is.na(fc.kegg.p$less[, "q.val"])

                    BTW, I would suggest you get your self familiar with R syntax and really understand the basics of gage/pathview packages. And you will find your work gets much easier.

                    Comment


                    • #11
                      Thanks a million bigmw, it worked! At least it downloaded the pathways I mean, but still there are some issues. Regarding R I am learning it, but still it's a very big subject that the pace of learning it is lower than its application. Bioinformatic is not my main field of studies, but I am trying to learn it. However, could you do a favour and see if you know what can be the source of the problem I am having here:

                      Code:
                      Getting gene ID data from KEGG...
                      Done with data retrieval!
                      [1] "Downloading xml files for spo03013, 1/1 pathways.."
                      [1] "Downloading png files for spo03013, 1/1 pathways.."
                      No of the genes or compounds mapped to the pathway!
                      Argument gene.idtype or cpd.idtype may be wrong.
                      Working in directory /home/parham/pombe/working_dir/gage03
                      Writing image file spo03013.deseq2.png
                      This happens for all pathways! Here is a link to the .png of downloaded pathways. There are just the pathways with no up/down regulation!
                      Last edited by Parharn; 06-20-2014, 01:02 AM.

                      Comment


                      • #12
                        The output message suggests that arguement gene.idtype may be wrong. The gene IDs in both your input data and your gene set data are KEGG gene IDs (KEGG uses Locus tag in this case). So you need to set that as below:
                        pv.out.list <- sapply(path.ids2, function(pid) pathview(gene.data= exp.fc, pathway.id=pid, species="spo", out.suffix=outsuffix, gene.idtype="KEGG"))

                        Note that you need to do path.ids2 instead of path.ids2[1:3] in the code above to include all selected pathways rather than the top 3 only (copy&paste from the demo example).

                        In pathview function, please look into the argument gene.idtype. Default gene.idtype="entrez",i.e. Entrez Gene, which are the primary KEGG gene ID for many common model organisms. For other species, gene.idtype should be set to "KEGG" as KEGG use other types of gene IDs. For more details check the help info:
                        ?pathview
                        more details are also available in section 7.5 Working with species of the tutorial:

                        Comment


                        • #13
                          Thanks a million bigmw, it made it work. I had read the pathview tutorial before but I couldn't figure out that part very well! You made it clear to me. And yes about the path.ids2 I could figure that out also, but thanks for reminding it. Without your help these steps could have take much much longer! Brilliant!

                          Comment


                          • #14
                            I have one more question! In part 6.5 of "RNA-seqworkflow" working on cufflinks data. how should I change the ID conversion line? Should I leave it as it is or KEGG?

                            Code:
                            > gnames.eg=pathview::id2eg(gnames, category ="symbol")

                            Comment


                            • #15
                              Here is a relevant thread you started previously:
                              Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                              You need to download gene_info.gz from NCBI and follow the instructions in the old thread.

                              Again, pathview::id2eg only works for major species with annotation package in Bioconductor, not for S. pombe. For details:
                              data(bods, package="pathview"); bods

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X