Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • gene sets for metabolism could not be used by gage/pathview!

    Hi bigmw,
    With your kind help, I have finished pathway analysis of signal transduction and fond two disturbed pathways. But I met problem with metabolism pathway analysis, because I could not find any disturbed pathway with the metabolism gene sets from Kegg. More weird, If I used whole gene sets for honeybees from kegg including the met/sig gene sets, the two disturbed signal pathways could not be find too. Some guys told me that this resulted from more gene sets, which increase p. values of each gene set. Do you have any idea on this strange problem?
    Thanks a lot!!
    Richard
    Last edited by wmseq; 12-07-2013, 06:45 AM.

  • #2
    Just want it to be seen now.

    Comment


    • #3
      It is normal that you don’t get any significant calls in a pathway analysis (with multiple pathways/tests) because none of the p-values (or q-values) is small enough. Very likely, there is not enough testing power with your data given its sample size, noise level and experiment quality. The adjusted p-values (or q-values) would be different when the total number of tests/pathways changes.
      With your current dataset, you may do 2 things:
      -Loosen the selection criteria (q-value cutoff), the option cutoff = 0.1 in sigGeneSet function and q.cutoff = 0.1in gagePipe function can be set to a bigger value, say 0.2 etc.
      -change the gene set size filter to include more pathways that are actually tested, the argument in gage function is set.size = c(10, 500). You can set it to be set.size = c(10, 2000) or even set.size = c(10, Inf).

      Some general suggestions that would help new users like you to use gage/pathview smoothly:
      -know basic statistics
      -get familiar with R/computer systems
      -go through the pacakge Reference Manuals/tutorials (and papers), know the basics of gage/pathview method and packages
      GAGE is a published method for gene set (enrichment or GSEA) or pathway analysis. GAGE is generally applicable independent of microarray or RNA-Seq data attributes including sample sizes, experimental designs, assay platforms, and other types of heterogeneity, and consistently achieves superior performance over other frequently used methods. In gage package, we provide functions for basic GAGE analysis, result processing and presentation. We have also built pipeline routines for of multiple GAGE analyses in a batch, comparison between parallel analyses, and combined analysis of heterogeneous data from different sources/studies. In addition, we provide demo microarray data and commonly used gene set data based on KEGG pathways and GO terms. These funtions and data are also useful for gene set analysis using other methods.

      Pathview is a tool set for pathway based data integration and visualization. It maps and renders a wide variety of biological data on relevant pathway graphs. All users need is to supply their data and specify the target pathway. Pathview automatically downloads the pathway graph data, parses the data file, maps user data to the pathway, and render pathway graph with the mapped data. In addition, Pathview also seamlessly integrates with pathway and gene set (enrichment) analysis tools for large-scale and fully automated analysis.

      Comment


      • #4
        Hi bigmw,
        I am sorry to ask you two more questions based on your answer.
        I did my pathway analysis according to the protocol on the paper,
        “RNA-Seq Data Pathway and Gene-set Analysis Workflow”. In the commands
        used by me, “ fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)”
        contains gage function, so that I can set the set size according your advice certainly, but the following information on gene set size confuses me.

        gage
        Set size:
        gene set size (number of genes) range to be considered for
        enrichment test. Tests for too small or too big gene sets are not
        robust statistically or informative biologically. Default to be
        set.size = c(10, 500).

        According to my understanding of it, too small or too big number of genes
        in a gene set is not advisable. How many genes in a gene set are
        suitable? Could I divide a big gene set into smaller sets, and then do the pathway
        analysis using each of them to get better results?

        As to setting the cutoff of sigGeneSet, it seems that sigGeneSet is not
        used in the protocol. The only command using “q. val” is “ sel <-
        fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[,
        "q.val"])”. Therefore, I just need changing <0.1 to <0.2, right?

        Happy holiday!

        Comment


        • #5
          Richard, thoughtful questions are always welcome. However, please be careful when making strong yet misleading claims/titles like “gene sets for metabolism could not be used by gage/pathview!”

          For your questions:
          You don’t want to use too small gene sets for sure. Big gene sets (like several thousand genes) are fine as long as it is not close to the size of background (the list of all genes). In this case, the test statistics (against background) will be less meaningful. But you won’t get false positives in this case for sure, hence it is not bad to set set.size=c(10, Inf) when needed. I don’t think it is good to split big gene set into smaller sets as you suggested.
          You don’t have to use sigGeneSet (or gagePipe) function, you can select significant gene sets using the code line as in the RNA-Seq Workflows, e.g.
          fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[,"q.val"])
          just change 0.1 to 0.2 or other proper cutoff value.

          Comment


          • #6
            bigmw,
            Sorry about the tittle. I have tried to change it, but I don't know why I could not change it successfully, because the change can not be seen as the thread tittle outside.
            Last edited by wmseq; 12-07-2013, 06:53 AM.

            Comment


            • #7
              No problem. Thanks!

              Comment


              • #8
                Hi bigmw,
                After I run the command for pathway analysis, I got a weird information as follows:

                > pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data = exp.fc, pathway.id = pid, species ="ame", out.suffix = out.suffix))
                No annotation package for the species ame, gene symbols not mapped!
                Working in directory /home/wenfu/CAseqanalysis
                Writing image file ame04745.edger.png
                No annotation package for the species ame, gene symbols not mapped!
                Working in directory /home/wenfu/CAseqanalysis
                Writing image file ame04391.edger.png
                Start tag expected, '<' not found

                In fact, the tittle of ame04391 pathway got by me is "Hippo signaling pathway- fly".


                What do "No annotation package for the species ame" and "Start tag expected, '<' not found" mean? In addition, where is the annotation package from?

                Thanks!

                Richard
                Last edited by wmseq; 12-09-2013, 10:30 AM.

                Comment


                • #9
                  Very likely, your input data exp.fc has the wrong gene ID type or your specified the wrong ID type. You may check pathview function documentation and look into the gene.idtype argument:
                  ?pathview

                  If you are not sure, within your analysis R session, do:
                  head(exp.fc)
                  And post the output here.

                  Comment


                  • #10
                    Hi bigmw,
                    As you know, I used beebase gene IDs at the beginning to do pathway analysis. With your help, I changed those IDs to Entrez Gene IDs, and do the analysis in R as follows:

                    > degene_data = read.csv("CDade.genes_and_gene_id.csv", header = TRUE)
                    > test<-subset(degene_data, GeneID!="NA")
                    > edger.fc = test$logFC
                    > names(edger.fc) = test$GeneID
                    > exp.fc=edger.fc
                    >out.suffix="edger"
                    > head(exp.fc,16)
                    409677 100576979 100577819 552035 413550 413908 552471 552829
                    5.557823 4.667221 4.516693 4.127615 3.986429 3.937341 -3.605962 -3.556323
                    406115 409345 552773 100577132 100578863 726617 100577669 100576152
                    3.446378 -3.404650 -3.368612 -3.127761 -3.063663 -2.949202 2.939236 2.877981

                    As you can see, the first row is Entrez Gene ID, the second FC in each pair of head(exp.fc) output. If there is problem, I am afraid that it is that I start with a .csv file from the ID change instead of et. Do you think it is possible?
                    One more question, what is the purpose of "out.suffix="edger"" command?
                    Thanks a lot!!

                    Richard
                    Last edited by wmseq; 12-10-2013, 08:05 AM.

                    Comment


                    • #11
                      Since I don’t have access to you data, I ran a similar example using simulated honey bee data and your target pathway as below. Pathview has a function, sim.mol.data, for data simulation. Note that I specified id.type="entrez" and gene.idtype="entrez" explicitly for clarity below, but these are default hence not really needed. I got a perfect pathview graph. I suspect that this is a problem SPECIFIC to your system again, similar problems have already happened many times on your computer. My suggestion:
                      -Start a new and clean R session and re-run your analysis. If you still have problem, try to run my examples below, see if that works.
                      -Please make suer you have updated R/Bioconductor. In the mean time, make sure you have your computer cleaned up completely as I’ve suggested before.

                      > ame.dat <- sim.mol.data(mol.type="gene",id.type="entrez",species="ame",nmol=5000)
                      > head(ame.dat)
                      409241 408547 413271 100576790 411735 412008
                      0.7390165 -2.1501213 0.8217849 1.6537538 -0.5823098 -0.7743898
                      > pv.out <- pathview(gene.data = ame.dat, gene.idtype="entrez",
                      + pathway.id = "04391", species = "ame", out.suffix = "ame")
                      [1] "Downloading xml files for ame04391, 1/1 pathways.."
                      [1] "Downloading png files for ame04391, 1/1 pathways.."
                      No annotation package for the species ame, gene symbols not mapped!
                      Working in directory /xxxx/xxx/xxx/
                      Writing image file ame04391.ame.png
                      # Note here “No annotation package for the species ame, gene symbols not mapped!” is a warning message for minor species, nothing has been wrong.
                      Last edited by bigmw; 12-11-2013, 05:33 PM.

                      Comment


                      • #12
                        Thank you very much, bigmw!!
                        Maybe, You could remember my questions on 12-09-2013:
                        > pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data = exp.fc, pathway.id = pid, species ="ame", out.suffix = out.suffix))
                        No annotation package for the species ame, gene symbols not mapped!
                        Working in directory /home/wenfu/CAseqanalysis
                        Writing image file ame04745.edger.png
                        No annotation package for the species ame, gene symbols not mapped!
                        Working in directory /home/wenfu/CAseqanalysis
                        Writing image file ame04391.edger.png
                        Start tag expected, '<' not found

                        In fact, I got tow significantly perturbed pathways using gage package. I suspected my results, because I got the warning message of "No annotation package for the spaceies ame, gene sysmols not mapped"; therefore, I sent my question on the warning message to you. This package worked very well for my analysis, one of the two pathways could be used to support my hypothesis.

                        Happy holiday!!!

                        Richard

                        Comment


                        • #13
                          I am glad you finally made it and got good results. One more note, gage/pathview has been extensively tested by Bioconductor daily building/checking processes, and widely used by users over the world. I am sure you have become more confident with them after working through your problems.
                          I see you have an extra message “Start tag expected, '<' not found”. This was not reproducible. This shouldn’t be pathview ouput as it worked normally in your last analysis. I still think there is something problematic with your computer or analysis session.

                          Comment


                          • #14
                            Hi bigmw,
                            As to “Start tag expected, '<' not found”, although I am not sure where it was from during my analysis session, I suspect the creation of exp.fc file. As you know, this file is created from et file during running edgeR in linux in the protocol, whereas it was created from a .csv file by me in R, because I need converting beebase gene IDs into Entrez gene IDs. Is it possible?
                            Sincerely,
                            Richard

                            Comment


                            • #15
                              Biomedical Informatics Postdoctoral Fellowship in Statistical Genetics and Transcript

                              A postdoctoral training position is currently available in Dr. Gary H. Gibbons’ Cardiovascular Cluster in the Inherited Disease Branch, Cardiovascular Cluster (IDB-CC) of the National Human Genome Research Institute (NHGRI). The successful candidate is expected to join an established Cardiovascular Disease Cluster team, which is currently comprised of biomedical informatics analysts, physicians, nurses, research assistants, computer science and engineer staff. Additionally, the candidate will work closely with our sister lab in the IDB-CC that focuses on population epidemiology with staff consisting of a Principal Investigator and Senior Population Epidemiologist, and five additional population/genetic epidemiologists.
                              The ongoing projects in Dr. Gibbons’ IDB-CC use biomedical informatics and systems biology approaches to integrate data from platforms such as next generation sequencing for the identification of genetic variation (SNPs, indels/CNVs, splice variants, tandem repeats and admixture mapping etc..) and transcriptome variation (gene expression, GWAS, microRNA, and methylation) between ancestral populations with cardiovascular disease or other complex diseases. Our lab uses these high-throughput technologies to identify, categorize and evaluate genomic to phenomic relationships that contribute to prevalence, severity, host natural resistance and treatment responsiveness of minority population’s with cardiovascular disease (CVD).
                              The qualified candidates should be highly motivated and have or be close to obtaining a MD and or Ph.D. with a focus in computational biology, statistical genetics, mathematics, bioinformatics, epigenetics or related field upon the job start date. The successful candidate should have experience in analyzing high-throughput genomic data, proficiency in at least one programming language (Perl, Java, R, Ruby, SAS, or C/C++) and very familiar with omics data dimensionality reduction utilizing statistical applications such as Plink, R GNU, Bioconductor and MATLAB. Good understanding of systems biology and familiarity with gene-gene interaction modeling and clustering with applications such as Ingenuity and GeneGO are desirable. Applicant must possess good communication skills and be fluent in both spoken and written English. Funding is available to support this position for up to five years. Salary is based on NIH standard. The candidate will have the opportunity to access many high throughput datasets and to interact with the investigators at the National Institutes of Health and other academic and science based institutions.
                              Interested applicants should submit curriculum vitae, a detailed letter of interest, and the names of three potential referees to Dr. Adam R. Davis, at [email protected] or to the address below.

                              Adam R. Davis, Ph.D.
                              Cardiovascular Cluster
                              Inherited Disease Branch
                              National Human Genome Research Institute
                              Building 10, Room 7N321
                              Bethesda, Maryland 20892

                              DHHS and NIH are Equal Opportunity Employers and encourage applications from
                              women and minorities.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X