Hi there,
First of all I have to say how incredible is the work people is doing at 1KG project. This post is not to criticize their work, but just to ask for some insight into how to best use their dataset.
So, I'm trying to obtain a whole genome VCF file for only one sample at a time and it is ridiculously difficult and tedious. The variation data in 1KG is provided as one VCF file per chromosome containing the genotypes in all samples analyzed. These files are big even compressed: from 11G for chromosome 1 to 1.8G for chromosome 22.
The first approach provided by 1KG project is tabix + vcftools command line tools:
This does not work very well as it relies on the network, so my first thought was to download the ALL.chr1... file. OK, but it still takes a long time for tabix to read the file and make a subset, and you have to do this for 24 chromosomes and then merge the results.
The second approach provided by the 1KG project is an online tool: the Data Slicer. It works all right, but only for small regions. As soon as you try to select a whole chromosome from a VCF file it does not seem to finish loading.
So, I downloaded all VCF files and made an script to launch all the needed commands, but it takes days to output a whole genome VCF file for only one sample. It is not really a good solution.
Am I missing something? I guess when the data was published nobody thought at the possibility of needing the whole data by sample. Anybody there went through the same problem??
Thanx,
Pablo.
First of all I have to say how incredible is the work people is doing at 1KG project. This post is not to criticize their work, but just to ask for some insight into how to best use their dataset.
So, I'm trying to obtain a whole genome VCF file for only one sample at a time and it is ridiculously difficult and tedious. The variation data in 1KG is provided as one VCF file per chromosome containing the genotypes in all samples analyzed. These files are big even compressed: from 11G for chromosome 1 to 1.8G for chromosome 22.
The first approach provided by 1KG project is tabix + vcftools command line tools:
Code:
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz | vcf-subset -c NA12842 | bgzip -c NA12842.1KG.chr1.vcf.gz
The second approach provided by the 1KG project is an online tool: the Data Slicer. It works all right, but only for small regions. As soon as you try to select a whole chromosome from a VCF file it does not seem to finish loading.
So, I downloaded all VCF files and made an script to launch all the needed commands, but it takes days to output a whole genome VCF file for only one sample. It is not really a good solution.
Am I missing something? I guess when the data was published nobody thought at the possibility of needing the whole data by sample. Anybody there went through the same problem??
Thanx,
Pablo.
Comment