Seqanswers Leaderboard Ad

**Richard Finney** · 07-16-2012, 07:11 PM

Well, that's pretty broad question. You're going to have to catalog what you have.
How many samples? What are you trying to predict? What are your "gene sets"? Are they pre-defined "pathways"? Are they interesting sets of correlated genes (from clustering)?

There are some sophisticated machine learning tools : "random forests", "support vector machines", etc. But, to start with,you might want to keep it simple. If you're predicting "develops disease" vs."doesn't get the disease", start with a a simple "t-test". This is for testing the mean between two groups. Understand the assumptions for t-test.
If you're prediction is something like "days to relapse", i.e. something continuous, then check out "linear regression". You'll want to use R, Matlab or a programming language with a good statistics library. Wikipedia, DuckDuckGo and a lot patience are your friends. Please check out "False Discovery Rate" and "Bonferoni correction". Just getting something with "0.05 p-value" while using 20,000 separate probes is not good enough. Make sure you have enough samples.

Verifying your results on an independent data set is critical. It's easy to get fooled by torturing a dataset and not getting confirmation on another. You want to aim for "we got bonferoni adjusted statistical significance and verified with another dataset".

The question of does it need to be "statistical" or "physiological" is a philosphical question. From an engineering perspective, as long as it works who cares how it works? From a science perspective, it's good to explain what's going on. Try and get someting that works and try and explain it.

**kopi-o** · 07-16-2012, 11:01 PM

Excellent reply!

**kicka11** · 07-17-2012, 07:09 AM

Thanks, more details

I agree, an excellent and helpful reply.

We have 80 patient samples which were profiled on DASL using frozen and then paraffin-derived RNA. The gene sets are a mixture of disease-specific gene sets for pathways (MYC & targets) and tissue agnostic sets; 8 were used in total and GSEA gave FDRs <0.25 hence we are hopeful we will get better results with more recent platforms.

The next step is to use a training set on Nanostring with a truncated gene set, look at the results and try to develop a classifier for a test set of between 80 - 200 samples in the future. All we are trying to decide is 'high expression' vs 'low expression' and validate it as a clinical test.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 55 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 52 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Clinician looks for advice regarding classifiers

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News