Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
The site for Watson's genome download dongshenglulv Personalized Genomics 2 03-24-2015 12:26 AM
Using dbsnps or use samtools to call from 1000 genomes southan Bioinformatics 4 06-27-2011 02:00 AM
Extracting genome specific SNPs from 1000 genomes maricu Bioinformatics 12 01-21-2011 02:46 AM
Program to relate SNPs to model genome annotation? pmiguel Bioinformatics 4 04-23-2010 06:33 AM
Jim Watson in Excruciating Detail: 454/Baylor Publish Complete Genome Sequence ECO Literature Watch 1 04-16-2008 02:43 PM

Thread Tools
Old 05-13-2009, 10:03 PM   #1
Location: Cambridge, UK

Join Date: May 2008
Posts: 12
Default SNPs Comparsion (Watson vs. YH vs dbSNPs vs X genome)


I got a list of SNPs in GFF format from a human genome experiment (as output of SOAPsnp) similar to the following:

This is the format of YH genome SNPs (Asian genome):

This is the Watson genome SNPs:

My end goal is to see how many SNPs they share in general and in the coding region and which SNPs are novel (not in dbSNP). Then I want to represent the data visually using R package. My questions are:
  • Where can I get dbSNP in GFF format for human genome? It seems to be in mysql format at the NCBI ftp. If is not avaible in GFF, how to prepare one?
  • I need your help to give me an idea of how to compare novel SNPs with no 'rs' id number between the 2 genomes (as a psudocode)? It may be a simple task to many bioinformaticians but I really don't have that much experience writing algorithms.
  • Say I got 6 list of SNPs from different human genome experiments. What is the best workflow to compare them to each other. (i.e. one against one or one against all in the same time).
  • Is comparing SNPs between genomes (3.2 million each) considered a CPU intensive task or need a lot of RAM? Would it need a cluster or a desktop would do the job?

Note: I am not a programmer but I do simple scripting in python.

Thank you for your help.

Last edited by salturki; 05-13-2009 at 10:07 PM.
salturki is offline   Reply With Quote
Old 05-14-2009, 05:57 AM   #2
Location: Wageningen, the Netherlands

Join Date: Jan 2008
Posts: 31

You may also be interested in this Korean genome which does have gff for its affy6 data. explains how to create a local mirror of dbsnp

from there you'll need to do a SELECT statement to pull out the rs#s.

Comparing non-rs# snps is not simple. If both snps are described from the same reference assembly it will be less painful, but thats unlikely to be true in the general case.

best workflow is based on what questions you want to answer. It also greatly affects your next question, CPU or Memory? When programming you can usually tune for more memory/less CPU and vice versa. In your case I expect the simplest approach is to slurp everything from the gff into memory, and then do queries against your mysql. That will be memory intensive. An alternative is to

1. presort the gff into numeric order.
2. export dbsnp into numeric order.
3. process the files sequentially (either step forward in file1 or file2 - keeping the rs#s in sync). This will have low memory requirements once the two lists are in order, and will simplify the code by keeping the heavy lifting in well optimized sorting routines.
4. This will be doable on a PC, but it'll take a while. If it were me I'd be crunching it in the amazon cloud on a small machine during dev, then switching to one of the beefy machines for the real run. Getting started in the amazon cloud may be more trouble than its worth, in which case my slightly neglected, but soon to be resurrected might be of some interest.
cariaso is offline   Reply With Quote
Old 05-17-2009, 09:40 PM   #3
Location: Cambridge, UK

Join Date: May 2008
Posts: 12


I appreciate your help.

Thank you
salturki is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 07:04 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO