Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SNPs Comparsion (Watson vs. YH vs dbSNPs vs X genome)

    Hi,

    I got a list of SNPs in GFF format from a human genome experiment (as output of SOAPsnp) similar to the following:

    This is the format of YH genome SNPs (Asian genome):


    This is the Watson genome SNPs:


    My end goal is to see how many SNPs they share in general and in the coding region and which SNPs are novel (not in dbSNP). Then I want to represent the data visually using R package. My questions are:
    • Where can I get dbSNP in GFF format for human genome? It seems to be in mysql format at the NCBI ftp. If is not avaible in GFF, how to prepare one?

    • I need your help to give me an idea of how to compare novel SNPs with no 'rs' id number between the 2 genomes (as a psudocode)? It may be a simple task to many bioinformaticians but I really don't have that much experience writing algorithms.

    • Say I got 6 list of SNPs from different human genome experiments. What is the best workflow to compare them to each other. (i.e. one against one or one against all in the same time).

    • Is comparing SNPs between genomes (3.2 million each) considered a CPU intensive task or need a lot of RAM? Would it need a cluster or a desktop would do the job?


    Note: I am not a programmer but I do simple scripting in python.

    Thank you for your help.
    Last edited by salturki; 05-13-2009, 10:07 PM.

  • #2
    You may also be interested in this Korean genome which does have gff for its affy6 data.



    ftp://ftp.kobic.re.kr/pub/PersonalGe..._Q40d4D100.gff

    http://www.ncbi.nlm.nih.gov/books/bv...ion.ch5.ch5-s6 explains how to create a local mirror of dbsnp

    from there you'll need to do a SELECT statement to pull out the rs#s.

    Comparing non-rs# snps is not simple. If both snps are described from the same reference assembly it will be less painful, but thats unlikely to be true in the general case.

    best workflow is based on what questions you want to answer. It also greatly affects your next question, CPU or Memory? When programming you can usually tune for more memory/less CPU and vice versa. In your case I expect the simplest approach is to slurp everything from the gff into memory, and then do queries against your mysql. That will be memory intensive. An alternative is to

    1. presort the gff into numeric order.
    2. export dbsnp into numeric order.
    3. process the files sequentially (either step forward in file1 or file2 - keeping the rs#s in sync). This will have low memory requirements once the two lists are in order, and will simplify the code by keeping the heavy lifting in well optimized sorting routines.
    4. This will be doable on a PC, but it'll take a while. If it were me I'd be crunching it in the amazon cloud on a small machine during dev, then switching to one of the beefy machines for the real run. Getting started in the amazon cloud may be more trouble than its worth, in which case my slightly neglected, but soon to be resurrected www.runblast.com might be of some interest.

    Comment


    • #3
      cariaso,

      I appreciate your help.

      Thank you

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 08:47 AM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Working...
      X