Hi,
I got a list of SNPs in GFF format from a human genome experiment (as output of SOAPsnp) similar to the following:
This is the format of YH genome SNPs (Asian genome):
This is the Watson genome SNPs:
My end goal is to see how many SNPs they share in general and in the coding region and which SNPs are novel (not in dbSNP). Then I want to represent the data visually using R package. My questions are:
Note: I am not a programmer but I do simple scripting in python.
Thank you for your help.
I got a list of SNPs in GFF format from a human genome experiment (as output of SOAPsnp) similar to the following:
This is the format of YH genome SNPs (Asian genome):
This is the Watson genome SNPs:
My end goal is to see how many SNPs they share in general and in the coding region and which SNPs are novel (not in dbSNP). Then I want to represent the data visually using R package. My questions are:
- Where can I get dbSNP in GFF format for human genome? It seems to be in mysql format at the NCBI ftp. If is not avaible in GFF, how to prepare one?
- I need your help to give me an idea of how to compare novel SNPs with no 'rs' id number between the 2 genomes (as a psudocode)? It may be a simple task to many bioinformaticians but I really don't have that much experience writing algorithms.
- Say I got 6 list of SNPs from different human genome experiments. What is the best workflow to compare them to each other. (i.e. one against one or one against all in the same time).
- Is comparing SNPs between genomes (3.2 million each) considered a CPU intensive task or need a lot of RAM? Would it need a cluster or a desktop would do the job?
Note: I am not a programmer but I do simple scripting in python.
Thank you for your help.
Comment