Seqanswers Leaderboard Ad

**wmseq** · 12-04-2013, 09:07 AM

Just want it to be seen now.

**bigmw** · 12-04-2013, 06:40 PM

It is normal that you don’t get any significant calls in a pathway analysis (with multiple pathways/tests) because none of the p-values (or q-values) is small enough. Very likely, there is not enough testing power with your data given its sample size, noise level and experiment quality. The adjusted p-values (or q-values) would be different when the total number of tests/pathways changes.
With your current dataset, you may do 2 things:
-Loosen the selection criteria (q-value cutoff), the option cutoff = 0.1 in sigGeneSet function and q.cutoff = 0.1in gagePipe function can be set to a bigger value, say 0.2 etc.
-change the gene set size filter to include more pathways that are actually tested, the argument in gage function is set.size = c(10, 500). You can set it to be set.size = c(10, 2000) or even set.size = c(10, Inf).

Some general suggestions that would help new users like you to use gage/pathview smoothly:
-know basic statistics
-get familiar with R/computer systems
-go through the pacakge Reference Manuals/tutorials (and papers), know the basics of gage/pathview method and packages

gage

http://bioconductor.org/packages/release/bioc/html/gage.html

GAGE is a published method for gene set (enrichment or GSEA) or pathway analysis. GAGE is generally applicable independent of microarray or RNA-Seq data attributes including sample sizes, experimental designs, assay platforms, and other types of heterogeneity, and consistently achieves superior performance over other frequently used methods. In gage package, we provide functions for basic GAGE analysis, result processing and presentation. We have also built pipeline routines for of multiple GAGE analyses in a batch, comparison between parallel analyses, and combined analysis of heterogeneous data from different sources/studies. In addition, we provide demo microarray data and commonly used gene set data based on KEGG pathways and GO terms. These funtions and data are also useful for gene set analysis using other methods.

pathview

http://bioconductor.org/packages/release/bioc/html/pathview.html

Pathview is a tool set for pathway based data integration and visualization. It maps and renders a wide variety of biological data on relevant pathway graphs. All users need is to supply their data and specify the target pathway. Pathview automatically downloads the pathway graph data, parses the data file, maps user data to the pathway, and render pathway graph with the mapped data. In addition, Pathview also seamlessly integrates with pathway and gene set (enrichment) analysis tools for large-scale and fully automated analysis.

**wmseq** · 12-06-2013, 07:02 AM

Hi bigmw,
I am sorry to ask you two more questions based on your answer.
I did my pathway analysis according to the protocol on the paper,
“RNA-Seq Data Pathway and Gene-set Analysis Workflow”. In the commands
used by me, “ fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL)”
contains gage function, so that I can set the set size according your advice certainly, but the following information on gene set size confuses me.

gage
Set size:
gene set size (number of genes) range to be considered for
enrichment test. Tests for too small or too big gene sets are not
robust statistically or informative biologically. Default to be
set.size = c(10, 500).

According to my understanding of it, too small or too big number of genes
in a gene set is not advisable. How many genes in a gene set are
suitable? Could I divide a big gene set into smaller sets, and then do the pathway
analysis using each of them to get better results?

As to setting the cutoff of sigGeneSet, it seems that sigGeneSet is not
used in the protocol. The only command using “q. val” is “ sel <-
fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[,
"q.val"])”. Therefore, I just need changing <0.1 to <0.2, right?

Happy holiday!

**bigmw** · 12-06-2013, 08:02 PM

Richard, thoughtful questions are always welcome. However, please be careful when making strong yet misleading claims/titles like “gene sets for metabolism could not be used by gage/pathview!”

For your questions:
You don’t want to use too small gene sets for sure. Big gene sets (like several thousand genes) are fine as long as it is not close to the size of background (the list of all genes). In this case, the test statistics (against background) will be less meaningful. But you won’t get false positives in this case for sure, hence it is not bad to set set.size=c(10, Inf) when needed. I don’t think it is good to split big gene set into smaller sets as you suggested.
You don’t have to use sigGeneSet (or gagePipe) function, you can select significant gene sets using the code line as in the RNA-Seq Workflows, e.g.
fc.kegg.p$greater[, "q.val"] < 0.1 & !is.na(fc.kegg.p$greater[,"q.val"])
just change 0.1 to 0.2 or other proper cutoff value.

**wmseq** · 12-07-2013, 06:46 AM

bigmw,
Sorry about the tittle. I have tried to change it, but I don't know why I could not change it successfully, because the change can not be seen as the thread tittle outside.

**bigmw** · 12-08-2013, 07:57 PM

No problem. Thanks!

**wmseq** · 12-09-2013, 08:34 AM

Hi bigmw,
After I run the command for pathway analysis, I got a weird information as follows:

> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data = exp.fc, pathway.id = pid, species ="ame", out.suffix = out.suffix))
No annotation package for the species ame, gene symbols not mapped!
Working in directory /home/wenfu/CAseqanalysis
Writing image file ame04745.edger.png
No annotation package for the species ame, gene symbols not mapped!
Working in directory /home/wenfu/CAseqanalysis
Writing image file ame04391.edger.png
Start tag expected, '<' not found

In fact, the tittle of ame04391 pathway got by me is "Hippo signaling pathway- fly".

What do "No annotation package for the species ame" and "Start tag expected, '<' not found" mean? In addition, where is the annotation package from?

Thanks!

Richard

**bigmw** · 12-09-2013, 07:24 PM

Very likely, your input data exp.fc has the wrong gene ID type or your specified the wrong ID type. You may check pathview function documentation and look into the gene.idtype argument:
?pathview

If you are not sure, within your analysis R session, do:
head(exp.fc)
And post the output here.

**wmseq** · 12-10-2013, 07:46 AM

Hi bigmw,
As you know, I used beebase gene IDs at the beginning to do pathway analysis. With your help, I changed those IDs to Entrez Gene IDs, and do the analysis in R as follows:

> degene_data = read.csv("CDade.genes_and_gene_id.csv", header = TRUE)
> test<-subset(degene_data, GeneID!="NA")
> edger.fc = test$logFC
> names(edger.fc) = test$GeneID
> exp.fc=edger.fc
>out.suffix="edger"
> head(exp.fc,16)
409677 100576979 100577819 552035 413550 413908 552471 552829
5.557823 4.667221 4.516693 4.127615 3.986429 3.937341 -3.605962 -3.556323
406115 409345 552773 100577132 100578863 726617 100577669 100576152
3.446378 -3.404650 -3.368612 -3.127761 -3.063663 -2.949202 2.939236 2.877981

As you can see, the first row is Entrez Gene ID, the second FC in each pair of head(exp.fc) output. If there is problem, I am afraid that it is that I start with a .csv file from the ID change instead of et. Do you think it is possible?
One more question, what is the purpose of "out.suffix="edger"" command?
Thanks a lot!!

Richard

**bigmw** · 12-10-2013, 07:37 PM

Since I don’t have access to you data, I ran a similar example using simulated honey bee data and your target pathway as below. Pathview has a function, sim.mol.data, for data simulation. Note that I specified id.type="entrez" and gene.idtype="entrez" explicitly for clarity below, but these are default hence not really needed. I got a perfect pathview graph. I suspect that this is a problem SPECIFIC to your system again, similar problems have already happened many times on your computer. My suggestion:
-Start a new and clean R session and re-run your analysis. If you still have problem, try to run my examples below, see if that works.
-Please make suer you have updated R/Bioconductor. In the mean time, make sure you have your computer cleaned up completely as I’ve suggested before.

> ame.dat <- sim.mol.data(mol.type="gene",id.type="entrez",species="ame",nmol=5000)
> head(ame.dat)
409241 408547 413271 100576790 411735 412008
0.7390165 -2.1501213 0.8217849 1.6537538 -0.5823098 -0.7743898
> pv.out <- pathview(gene.data = ame.dat, gene.idtype="entrez",
+ pathway.id = "04391", species = "ame", out.suffix = "ame")
[1] "Downloading xml files for ame04391, 1/1 pathways.."
[1] "Downloading png files for ame04391, 1/1 pathways.."
No annotation package for the species ame, gene symbols not mapped!
Working in directory /xxxx/xxx/xxx/
Writing image file ame04391.ame.png
# Note here “No annotation package for the species ame, gene symbols not mapped!” is a warning message for minor species, nothing has been wrong.

**wmseq** · 12-11-2013, 12:05 PM

Thank you very much, bigmw!!
Maybe, You could remember my questions on 12-09-2013:
> pv.out.list <- sapply(path.ids2[1:3], function(pid) pathview(gene.data = exp.fc, pathway.id = pid, species ="ame", out.suffix = out.suffix))
No annotation package for the species ame, gene symbols not mapped!
Working in directory /home/wenfu/CAseqanalysis
Writing image file ame04745.edger.png
No annotation package for the species ame, gene symbols not mapped!
Working in directory /home/wenfu/CAseqanalysis
Writing image file ame04391.edger.png
Start tag expected, '<' not found

In fact, I got tow significantly perturbed pathways using gage package. I suspected my results, because I got the warning message of "No annotation package for the spaceies ame, gene sysmols not mapped"; therefore, I sent my question on the warning message to you. This package worked very well for my analysis, one of the two pathways could be used to support my hypothesis.

Happy holiday!!!

Richard

**bigmw** · 12-11-2013, 05:34 PM

I am glad you finally made it and got good results. One more note, gage/pathview has been extensively tested by Bioconductor daily building/checking processes, and widely used by users over the world. I am sure you have become more confident with them after working through your problems.
I see you have an extra message “Start tag expected, '<' not found”. This was not reproducible. This shouldn’t be pathview ouput as it worked normally in your last analysis. I still think there is something problematic with your computer or analysis session.

**wmseq** · 12-12-2013, 12:13 PM

Hi bigmw,
As to “Start tag expected, '<' not found”, although I am not sure where it was from during my analysis session, I suspect the creation of exp.fc file. As you know, this file is created from et file during running edgeR in linux in the protocol, whereas it was created from a .csv file by me in R, because I need converting beebase gene IDs into Entrez gene IDs. Is it possible?
Sincerely,
Richard

**Sharon Collins** · 12-12-2013, 12:26 PM

Biomedical Informatics Postdoctoral Fellowship in Statistical Genetics and Transcript

A postdoctoral training position is currently available in Dr. Gary H. Gibbons’ Cardiovascular Cluster in the Inherited Disease Branch, Cardiovascular Cluster (IDB-CC) of the National Human Genome Research Institute (NHGRI). The successful candidate is expected to join an established Cardiovascular Disease Cluster team, which is currently comprised of biomedical informatics analysts, physicians, nurses, research assistants, computer science and engineer staff. Additionally, the candidate will work closely with our sister lab in the IDB-CC that focuses on population epidemiology with staff consisting of a Principal Investigator and Senior Population Epidemiologist, and five additional population/genetic epidemiologists.
The ongoing projects in Dr. Gibbons’ IDB-CC use biomedical informatics and systems biology approaches to integrate data from platforms such as next generation sequencing for the identification of genetic variation (SNPs, indels/CNVs, splice variants, tandem repeats and admixture mapping etc..) and transcriptome variation (gene expression, GWAS, microRNA, and methylation) between ancestral populations with cardiovascular disease or other complex diseases. Our lab uses these high-throughput technologies to identify, categorize and evaluate genomic to phenomic relationships that contribute to prevalence, severity, host natural resistance and treatment responsiveness of minority population’s with cardiovascular disease (CVD).
The qualified candidates should be highly motivated and have or be close to obtaining a MD and or Ph.D. with a focus in computational biology, statistical genetics, mathematics, bioinformatics, epigenetics or related field upon the job start date. The successful candidate should have experience in analyzing high-throughput genomic data, proficiency in at least one programming language (Perl, Java, R, Ruby, SAS, or C/C++) and very familiar with omics data dimensionality reduction utilizing statistical applications such as Plink, R GNU, Bioconductor and MATLAB. Good understanding of systems biology and familiarity with gene-gene interaction modeling and clustering with applications such as Ingenuity and GeneGO are desirable. Applicant must possess good communication skills and be fluent in both spoken and written English. Funding is available to support this position for up to five years. Salary is based on NIH standard. The candidate will have the opportunity to access many high throughput datasets and to interact with the investigators at the National Institutes of Health and other academic and science based institutions.
Interested applicants should submit curriculum vitae, a detailed letter of interest, and the names of three potential referees to Dr. Adam R. Davis, at [email protected] or to the address below.

Adam R. Davis, Ph.D.
Cardiovascular Cluster
Inherited Disease Branch
National Human Genome Research Institute
Building 10, Room 7N321
Bethesda, Maryland 20892

DHHS and NIH are Equal Opportunity Employers and encourage applications from
women and minorities.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

gene sets for metabolism could not be used by gage/pathview!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News