Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SNPs Comparsion (Watson vs. YH vs dbSNPs vs X genome)

    Hi,

    I got a list of SNPs in GFF format from a human genome experiment (as output of SOAPsnp) similar to the following:

    This is the format of YH genome SNPs (Asian genome):


    This is the Watson genome SNPs:


    My end goal is to see how many SNPs they share in general and in the coding region and which SNPs are novel (not in dbSNP). Then I want to represent the data visually using R package. My questions are:
    • Where can I get dbSNP in GFF format for human genome? It seems to be in mysql format at the NCBI ftp. If is not avaible in GFF, how to prepare one?

    • I need your help to give me an idea of how to compare novel SNPs with no 'rs' id number between the 2 genomes (as a psudocode)? It may be a simple task to many bioinformaticians but I really don't have that much experience writing algorithms.

    • Say I got 6 list of SNPs from different human genome experiments. What is the best workflow to compare them to each other. (i.e. one against one or one against all in the same time).

    • Is comparing SNPs between genomes (3.2 million each) considered a CPU intensive task or need a lot of RAM? Would it need a cluster or a desktop would do the job?


    Note: I am not a programmer but I do simple scripting in python.

    Thank you for your help.
    Last edited by salturki; 05-13-2009, 10:07 PM.

  • #2
    You may also be interested in this Korean genome which does have gff for its affy6 data.



    ftp://ftp.kobic.re.kr/pub/PersonalGe..._Q40d4D100.gff

    http://www.ncbi.nlm.nih.gov/books/bv...ion.ch5.ch5-s6 explains how to create a local mirror of dbsnp

    from there you'll need to do a SELECT statement to pull out the rs#s.

    Comparing non-rs# snps is not simple. If both snps are described from the same reference assembly it will be less painful, but thats unlikely to be true in the general case.

    best workflow is based on what questions you want to answer. It also greatly affects your next question, CPU or Memory? When programming you can usually tune for more memory/less CPU and vice versa. In your case I expect the simplest approach is to slurp everything from the gff into memory, and then do queries against your mysql. That will be memory intensive. An alternative is to

    1. presort the gff into numeric order.
    2. export dbsnp into numeric order.
    3. process the files sequentially (either step forward in file1 or file2 - keeping the rs#s in sync). This will have low memory requirements once the two lists are in order, and will simplify the code by keeping the heavy lifting in well optimized sorting routines.
    4. This will be doable on a PC, but it'll take a while. If it were me I'd be crunching it in the amazon cloud on a small machine during dev, then switching to one of the beefy machines for the real run. Getting started in the amazon cloud may be more trouble than its worth, in which case my slightly neglected, but soon to be resurrected www.runblast.com might be of some interest.

    Comment


    • #3
      cariaso,

      I appreciate your help.

      Thank you

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      8 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      8 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      49 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      66 views
      0 likes
      Last Post seqadmin  
      Working...
      X