SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Does anybody know how to convert dbsnp to .vcf (variant calling format)? alexbmp Bioinformatics 13 04-05-2015 10:24 PM
How to get list of column in vcf file using Vcf.pm? jessada Bioinformatics 0 01-20-2012 07:22 AM
Anyone know where to find older dbSNP .vcf files? petriedish Bioinformatics 0 02-15-2011 11:20 AM
VCF ID field with dbSNP rsID trickytank Bioinformatics 0 01-18-2011 11:11 PM

Reply
 
Thread Tools
Old 01-23-2012, 06:22 PM   #1
ashuchawla
Member
 
Location: san diego

Join Date: Jan 2012
Posts: 38
Default Require dbsnp file in vcf format for mycobacterium tuberculosis.

I have to run GATK's Base Quality Score Recalibration for the organism Mycobacterium Tuberculosis. One of the input files is a dbsnp file for this bacteria. I am unable to get this file in a vcf format as required by the GATK's reclibration program. Is there a way I can convert the snp data from - http://www.ncbi.nlm.nih.gov/snp into a vcf file format for mycobacterium tuberculosis?

Appreciate the help.
ashuchawla is offline   Reply With Quote
Old 01-23-2012, 06:51 PM   #2
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

I've written a Perl script for converting from dbSNP format to VCF format. It's not a perfect solution and comes with two pretty large caveats; firstly, it will only convert SNVs and will not convert InDels and secondly, it has only been tested on the mm9 dbSNP128 (which is the database I needed to convert).

The file can be downloaded from https://sites.google.com/site/peterhickey/home/software and is further explained here http://statsandgenomes.wordpress.com...bsnp-vcf-file/

Please feel free to modify the script as you desire.

Cheers,
Pete

Last edited by PeteH; 01-24-2012 at 02:33 PM.
PeteH is offline   Reply With Quote
Old 01-23-2012, 09:45 PM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

I encountered the same problem when trying to put microbial data through GATK, but I don't think that downloading and converting some file you download off the internet is going to help.

What do you think the software is going to do with that vcf, and is that operation going to help you answer the problem your data is supposed to help you solve? It seems to me that that including a vcf in that command is supposed to help you filter out SNPs known to be found in the population, so that you can concentrate on SNPs novel to your sample. But is that necessarily what you want to do with your experiment? If my sample has a KatG or GyrA mutation, I don't want those variations ignored because those are already described in the literature, I need to know they are there.
swbarnes2 is offline   Reply With Quote
Old 01-24-2012, 02:21 PM   #4
ashuchawla
Member
 
Location: san diego

Join Date: Jan 2012
Posts: 38
Default

Thanks a tonne Pete, one more question for you, what kind of input did you give this code? I mean in what format did you download the snp's from ncbi, it gives a variety of choices - text file, fasta file etc etc. I am a beginner in working with perl, so kinda catching up with it
ashuchawla is offline   Reply With Quote
Old 01-24-2012, 02:22 PM   #5
ashuchawla
Member
 
Location: san diego

Join Date: Jan 2012
Posts: 38
Default

Thanks a tonne Pete, one more question for you, what kind of input did you give this code? I mean in what format did you download the snp's from ncbi, it gives a variety of choices - text file, fasta file etc etc. I am a beginner in working with perl, so kinda catching up with it
ashuchawla is offline   Reply With Quote
Old 01-24-2012, 02:41 PM   #6
ashuchawla
Member
 
Location: san diego

Join Date: Jan 2012
Posts: 38
Default

Quote:
Originally Posted by swbarnes2 View Post
I encountered the same problem when trying to put microbial data through GATK, but I don't think that downloading and converting some file you download off the internet is going to help.

What do you think the software is going to do with that vcf, and is that operation going to help you answer the problem your data is supposed to help you solve? It seems to me that that including a vcf in that command is supposed to help you filter out SNPs known to be found in the population, so that you can concentrate on SNPs novel to your sample. But is that necessarily what you want to do with your experiment? If my sample has a KatG or GyrA mutation, I don't want those variations ignored because those are already described in the literature, I need to know they are there.
***********************************

swbarnes2 - I think you have some misunderstanding about how this thing works. The filter out of known snp's is necessary because the program - countcovariates works in a way that it compares how often bases in the organism I am working on mismatches the reference organism's bases. And since a snp will obviously mismatch as it is a change in the base at a particular position in my organism and the reference organism - that is why it is a snp, it will be good to ignore it.

Ashu
ashuchawla is offline   Reply With Quote
Old 01-24-2012, 02:54 PM   #7
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Quote:
Originally Posted by ashuchawla View Post
Thanks a tonne Pete, one more question for you, what kind of input did you give this code? I mean in what format did you download the snp's from ncbi, it gives a variety of choices - text file, fasta file etc etc. I am a beginner in working with perl, so kinda catching up with it
Unfortunately I'm not familiar with bacterial genomics and it doesn't appear that the SNPs for Mycobacterium Tuberculosis are available from the site I downloaded the mouse data from http://hgdownload.cse.ucsc.edu/downloads.html, or more specifically http://hgdownload.cse.ucsc.edu/golde.../snp128.txt.gz.

My script will only be useful if you can find your SNPs in a format similar to that of http://hgdownload.cse.ucsc.edu/golde.../snp128.txt.gz.
PeteH is offline   Reply With Quote
Old 01-24-2012, 03:07 PM   #8
ashuchawla
Member
 
Location: san diego

Join Date: Jan 2012
Posts: 38
Default

Quote:
Originally Posted by PeteH View Post
Unfortunately I'm not familiar with bacterial genomics and it doesn't appear that the SNPs for Mycobacterium Tuberculosis are available from the site I downloaded the mouse data from http://hgdownload.cse.ucsc.edu/downloads.html, or more specifically http://hgdownload.cse.ucsc.edu/golde.../snp128.txt.gz.

My script will only be useful if you can find your SNPs in a format similar to that of http://hgdownload.cse.ucsc.edu/golde.../snp128.txt.gz.
************************************
Yes but they are available at - http://www.ncbi.nlm.nih.gov/snp, if you put mycobacterium tuberculosis in the box right next to "for" and click "Go" you will see a long list of snp. You can download them by clicking "Send To".
ashuchawla is offline   Reply With Quote
Old 01-24-2012, 03:16 PM   #9
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Quote:
Originally Posted by ashuchawla View Post
************************************
Yes but they are available at - http://www.ncbi.nlm.nih.gov/snp, if you put mycobacterium tuberculosis in the box right next to "for" and click "Go" you will see a long list of snp. You can download them by clicking "Send To".
I realise this, but the format is different to that supported by my script. Having looked at your file format I think it should be fairly simple to convert this to VCF with a bit of scripting.
PeteH is offline   Reply With Quote
Old 01-24-2012, 03:27 PM   #10
ashuchawla
Member
 
Location: san diego

Join Date: Jan 2012
Posts: 38
Default

Quote:
Originally Posted by PeteH View Post
I realise this, but the format is different to that supported by my script. Having looked at your file format I think it should be fairly simple to convert this to VCF with a bit of scripting.
Ohh great. thanks for saying that... I am a newbie in this field and this is my first project ... i hope it works out
So do you think I should modify your code or write a new one from scratch? what exactly do u meant by scripting here?

Also, I am not able to open the link that you have posted, I donno why, I will try some more
ashuchawla is offline   Reply With Quote
Old 01-24-2012, 03:38 PM   #11
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Quote:
Originally Posted by ashuchawla View Post
Ohh great. thanks for saying that... I am a newbie in this field and this is my first project ... i hope it works out
So do you think I should modify your code or write a new one from scratch? what exactly do u meant by scripting here?

Also, I am not able to open the link that you have posted, I donno why, I will try some more
Sorry, I'm not sure why my links aren't working for you.

By scripting I mean writing a program in, for example, the Perl or Python programming language. Your program should read in each line of your SNP file one-by-one, convert each line to VCF and write each converted-line to an output file.

Do you have any experience programming in a particular language? This sort of problem is a great way to learn basic text-parsing and text-manipulation. I'd start from scratch if I were you since you'll learn a lot more by doing it this way and also because my code is not going to be a lot of help.

Be sure that you have a good understanding of the subtleties of the dbSNP format and VCF (the VCF is described in detail at http://www.1000genomes.org/wiki/Anal...mat-version-41. For instance, VCF is a "1-based" format because the first position on a chromosome is called position 1; this is in contrast to "0-based" formats where the first position on a chromosome is called position 0.

Good luck!
PeteH is offline   Reply With Quote
Old 01-24-2012, 03:47 PM   #12
ashuchawla
Member
 
Location: san diego

Join Date: Jan 2012
Posts: 38
Default

Quote:
Originally Posted by PeteH View Post
Sorry, I'm not sure why my links aren't working for you.

By scripting I mean writing a program in, for example, the Perl or Python programming language. Your program should read in each line of your SNP file one-by-one, convert each line to VCF and write each converted-line to an output file.

Do you have any experience programming in a particular language? This sort of problem is a great way to learn basic text-parsing and text-manipulation. I'd start from scratch if I were you since you'll learn a lot more by doing it this way and also because my code is not going to be a lot of help.

Be sure that you have a good understanding of the subtleties of the dbSNP format and VCF (the VCF is described in detail at http://www.1000genomes.org/wiki/Anal...mat-version-41. For instance, VCF is a "1-based" format because the first position on a chromosome is called position 1; this is in contrast to "0-based" formats where the first position on a chromosome is called position 0.

Good luck!
Thank you so much Pete. I appreciate your help. I have some experience in PL/SQL and JAVA. I should start on my code then...thanks again.
ashuchawla is offline   Reply With Quote
Old 01-25-2012, 03:45 PM   #13
ashuchawla
Member
 
Location: san diego

Join Date: Jan 2012
Posts: 38
Default

Quote:
Originally Posted by ashuchawla View Post
Thank you so much Pete. I appreciate your help. I have some experience in PL/SQL and JAVA. I should start on my code then...thanks again.
Pete, one more question, did u have the positions of snp's in your dbsnp file? I cannot check this as I am unable to access your files .The field "position" in the vcf file has to be populated by the corresponding "position" in the dbsnp file.
But I donot have those position numbers in the ncbi snp data

Just wanted to check how did u manage this problem or did u have the position numbers in your snp file already? Is there a way of mapping around 30k snps with a reference genome and getting respective position numbers?
ashuchawla is offline   Reply With Quote
Old 01-28-2012, 11:46 PM   #14
PeteH
Member
 
Location: Melbourne

Join Date: Jun 2010
Posts: 64
Default

Apologies for the delay in my response. My data did have positions in the dbSNP file. I'm not sure how to deal with this issue in the mycobacterium tuberculosis data. Sorry I couldn't be more help.
PeteH is offline   Reply With Quote
Reply

Tags
dbsnp file, gatk, vcf format

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:43 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO