SEQanswers

Go Back   SEQanswers > Applications Forums > Genomic Resequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
1000 genomes VCF format? yuanzhi Bioinformatics 14 02-26-2013 01:16 AM
1000 Genomes VCF Files ashkot Bioinformatics 8 02-25-2013 02:12 PM
mpileup | bcftools 1000 genomes - empty vcf cdias Bioinformatics 2 09-02-2012 06:10 AM
vcf-tools vcf-stats sample question Rubal7 Bioinformatics 1 04-09-2012 12:42 AM
Extracting genome specific SNPs from 1000 genomes maricu Bioinformatics 12 01-21-2011 02:46 AM

Reply
 
Thread Tools
Old 12-12-2012, 08:00 AM   #1
priesgo
Member
 
Location: Spain

Join Date: Aug 2012
Posts: 22
Default Select only one sample in 1000 Genomes whole genome VCF

Hi there,

First of all I have to say how incredible is the work people is doing at 1KG project. This post is not to criticize their work, but just to ask for some insight into how to best use their dataset.

So, I'm trying to obtain a whole genome VCF file for only one sample at a time and it is ridiculously difficult and tedious. The variation data in 1KG is provided as one VCF file per chromosome containing the genotypes in all samples analyzed. These files are big even compressed: from 11G for chromosome 1 to 1.8G for chromosome 22.

The first approach provided by 1KG project is tabix + vcftools command line tools:
Code:
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz | vcf-subset -c NA12842 | bgzip -c NA12842.1KG.chr1.vcf.gz
This does not work very well as it relies on the network, so my first thought was to download the ALL.chr1... file. OK, but it still takes a long time for tabix to read the file and make a subset, and you have to do this for 24 chromosomes and then merge the results.

The second approach provided by the 1KG project is an online tool: the Data Slicer. It works all right, but only for small regions. As soon as you try to select a whole chromosome from a VCF file it does not seem to finish loading.


So, I downloaded all VCF files and made an script to launch all the needed commands, but it takes days to output a whole genome VCF file for only one sample. It is not really a good solution.


Am I missing something? I guess when the data was published nobody thought at the possibility of needing the whole data by sample. Anybody there went through the same problem??



Thanx,
Pablo.
priesgo is offline   Reply With Quote
Old 12-12-2012, 09:16 AM   #2
cwhelan
Member
 
Location: Cambridge, MA

Join Date: Nov 2010
Posts: 23
Default

Try using the -e and -a options for vcf-subset. I believe that without those it still writes a line for every locus that has a variant, even if that variant is not present in the sample that you are subsetting on, causing very large files to be written. Also I'm not sure that you need tabix if you're operating on the whole chromosome file. I just did something similar to what you are attempting, although I was only extracting indels, and it did not take me days. I did something like this:

Quote:
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 X
do
vcf-subset -e -a -c $SAMPLE ALL.chr${i}.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | bgzip -c > ALL.chr${i}.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.$SAMPLE.vcf.gz
done
cwhelan is offline   Reply With Quote
Old 12-13-2012, 03:31 AM   #3
priesgo
Member
 
Location: Spain

Join Date: Aug 2012
Posts: 22
Default

Hi,


Thanks for your response!
I actually was wrong with the code I posted, I am already executing vcf-subset with -e and -a and I'm already not using tabix. So, we are at the same point.

I just launched the following command for chromosome 1 and sample NA12814:
Code:
vcf-subset -e -a -c NA12814 ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz | bgzip -c > NA12814/exome_variants/NA12814.chr1.vcf.gz
1 hour 10 minutes and counting... is it taking so much time in your case?


Pablo.
priesgo is offline   Reply With Quote
Old 12-13-2012, 03:50 AM   #4
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

I can't handle files > 4G.

maybe try the 2123 genomes (except _x,_y,_m) in the
/omni directory ? It's still difficult and tedious and lengthy
to get a full genome, so far I just have the 2123 sequences,
64155 SNP-positions only, of chromosome 17, 1st allele, 136MB
gsgs is offline   Reply With Quote
Old 12-13-2012, 04:00 AM   #5
priesgo
Member
 
Location: Spain

Join Date: Aug 2012
Posts: 22
Default

HI there,

What is the 2123???
priesgo is offline   Reply With Quote
Old 12-13-2012, 04:33 AM   #6
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

they have 2123 individuals there, it's also described in the paper,
page 60 supplemental material,

http://www.nature.com/nature/journal...re11632-s1.pdf

page 60 , chapter 10.5

it's a sub-project, as I understood.
~2M SNPs for 2123 individuals

Last edited by gsgs; 12-13-2012 at 04:52 AM.
gsgs is offline   Reply With Quote
Old 12-13-2012, 11:35 PM   #7
priesgo
Member
 
Location: Spain

Join Date: Aug 2012
Posts: 22
Default

Thanks gsgs,
It seems a good alternative, but in my case I need data from NGS.

By the way I launched yesterday an script like the one cwhelan proposed and it is now with the chromosome 15 and it has been already running for 21 hours and counting...


Pablo.
priesgo is offline   Reply With Quote
Old 12-14-2012, 12:16 AM   #8
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

searching what NGS is ...

Wang,Y.,Lu,J.,Yu,J.,Gibbs,R.&Yu,F.An integrative variant analysis
pipeline for accurate genotype/haplotype inference in population NGS
data.n$revision(2012).

https://www.hgsc.bcm.edu/content/SNPTools

do they have, what you want ?

NGS = next generation sequencing ?

Last edited by gsgs; 12-14-2012 at 12:19 AM.
gsgs is offline   Reply With Quote
Old 12-14-2012, 12:43 AM   #9
priesgo
Member
 
Location: Spain

Join Date: Aug 2012
Posts: 22
Default

Mmmmm, sorry I didn't explain well myself.

Yes I was referring to Next Generation Sequencing. The thing is that the OMNI data you proposed is obtained via Illumina OMNI microarray genotyping, while the main data comes from NGS machines (Illumina, SOLiD or 454 in this case).
For my purpose I need the data obtained from NGS.

About SNPTools, it seems a good tool for SNP calling, but it does not manage VCF files. So it does not really apply for this problem. But in aby case I didn't know it I'll have a look at it.


Regards,
Pablo.
priesgo is offline   Reply With Quote
Old 12-14-2012, 02:20 AM   #10
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

we get that giant (positions,sequences) matrix, but we need the transpose (sequences,positions)
transposing a big matrix needs much memory

better you split the matrix into smaller parts and transpose them separately ?!


"most difficult" says wikipedia
http://en.wikipedia.org/wiki/In-plac..._transposition

Last edited by gsgs; 12-14-2012 at 02:25 AM.
gsgs is offline   Reply With Quote
Old 12-14-2012, 08:39 AM   #11
cwhelan
Member
 
Location: Cambridge, MA

Join Date: Nov 2010
Posts: 23
Default

It's strange that that is taking so long for you; I don't think it took as long for me. I ran each fille in parallel on different cores, that's one way to speed it up.
cwhelan is offline   Reply With Quote
Old 12-14-2012, 09:42 PM   #12
priesgo
Member
 
Location: Spain

Join Date: Aug 2012
Posts: 22
Default

gsgs I am trying to process this data using existing tools as vcftools I am avoiding getting into the guts of it. Anyway this may apply to optimize vcftools.
priesgo is offline   Reply With Quote
Old 06-10-2013, 07:36 AM   #13
albireo
Member
 
Location: Europe

Join Date: Sep 2012
Posts: 39
Default

Hi, I have the same problem as the OP. I understand how to slice the chromosome data using VCF tools, however I would be glad if you could explain how to stream the data directly to VCF tools without downloading the huge .vcf files first. If I understand correctly, tabix is useless here because I want data for the full chromosome, not for an interval.

Any ideas? Thanks in advance.
albireo is offline   Reply With Quote
Reply

Tags
1000genomes, variations, vcf

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:14 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO