Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clinician looks for advice regarding classifiers

    I'm a trained Hematologist but am new to research and genomic work. We want to develop a classifier for a gene and target genes; we have found in a cohort of samples many of these genes show coordinate enrichment (tumor samples profiled for about 16,000 genes and 8 gene sets of 21-130 genes each used).

    Now I need to develop a classifier so we can use NanoString and categorise patients as high expressers of the set or low expressers. Ideally we want this to become a clinical test.

    This is where I am floundering. How do I select the best gene set for a classifier? Does it need to be physiological or just statistical in origin?

    Any advice gratefully received. I am improving my stats knowledge but it remains rudimentary so be gentle!

  • #2
    Well, that's pretty broad question. You're going to have to catalog what you have.
    How many samples? What are you trying to predict? What are your "gene sets"? Are they pre-defined "pathways"? Are they interesting sets of correlated genes (from clustering)?

    There are some sophisticated machine learning tools : "random forests", "support vector machines", etc. But, to start with,you might want to keep it simple. If you're predicting "develops disease" vs."doesn't get the disease", start with a a simple "t-test". This is for testing the mean between two groups. Understand the assumptions for t-test.
    If you're prediction is something like "days to relapse", i.e. something continuous, then check out "linear regression". You'll want to use R, Matlab or a programming language with a good statistics library. Wikipedia, DuckDuckGo and a lot patience are your friends. Please check out "False Discovery Rate" and "Bonferoni correction". Just getting something with "0.05 p-value" while using 20,000 separate probes is not good enough. Make sure you have enough samples.

    Verifying your results on an independent data set is critical. It's easy to get fooled by torturing a dataset and not getting confirmation on another. You want to aim for "we got bonferoni adjusted statistical significance and verified with another dataset".

    The question of does it need to be "statistical" or "physiological" is a philosphical question. From an engineering perspective, as long as it works who cares how it works? From a science perspective, it's good to explain what's going on. Try and get someting that works and try and explain it.
    Last edited by Richard Finney; 07-17-2012, 07:29 AM. Reason: got C's in spelling in grade school

    Comment


    • #3
      Excellent reply!

      Comment


      • #4
        Thanks, more details

        I agree, an excellent and helpful reply.

        We have 80 patient samples which were profiled on DASL using frozen and then paraffin-derived RNA. The gene sets are a mixture of disease-specific gene sets for pathways (MYC & targets) and tissue agnostic sets; 8 were used in total and GSEA gave FDRs <0.25 hence we are hopeful we will get better results with more recent platforms.

        The next step is to use a training set on Nanostring with a truncated gene set, look at the results and try to develop a classifier for a test set of between 80 - 200 samples in the future. All we are trying to decide is 'high expression' vs 'low expression' and validate it as a clinical test.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        45 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X