Seqanswers Leaderboard Ad

**cwhelan** · 12-12-2012, 10:16 AM

Try using the -e and -a options for vcf-subset. I believe that without those it still writes a line for every locus that has a variant, even if that variant is not present in the sample that you are subsetting on, causing very large files to be written. Also I'm not sure that you need tabix if you're operating on the whole chromosome file. I just did something similar to what you are attempting, although I was only extracting indels, and it did not take me days. I did something like this:

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 X
do
vcf-subset -e -a -c $SAMPLE ALL.chr${i}.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | bgzip -c > ALL.chr${i}.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.$SAMPLE.vcf.gz
done

**priesgo** · 12-13-2012, 04:31 AM

Hi,

Thanks for your response!
I actually was wrong with the code I posted, I am already executing vcf-subset with -e and -a and I'm already not using tabix. So, we are at the same point.

I just launched the following command for chromosome 1 and sample NA12814:

Code:

vcf-subset -e -a -c NA12814 ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz | bgzip -c > NA12814/exome_variants/NA12814.chr1.vcf.gz

1 hour 10 minutes and counting... is it taking so much time in your case?

Pablo.

**gsgs** · 12-13-2012, 04:50 AM

I can't handle files > 4G.

maybe try the 2123 genomes (except _x,_y,_m) in the
/omni directory ? It's still difficult and tedious and lengthy
to get a full genome, so far I just have the 2123 sequences,
64155 SNP-positions only, of chromosome 17, 1st allele, 136MB

**priesgo** · 12-13-2012, 05:00 AM

HI there,

What is the 2123???

**gsgs** · 12-13-2012, 05:33 AM

they have 2123 individuals there, it's also described in the paper,
page 60 supplemental material,

http://www.nature.com/nature/journal/v491/n7422/extref/nature11632-s1.pdf

page 60 , chapter 10.5

it's a sub-project, as I understood.
~2M SNPs for 2123 individuals

**priesgo** · 12-14-2012, 12:35 AM

Thanks gsgs,
It seems a good alternative, but in my case I need data from NGS.

By the way I launched yesterday an script like the one cwhelan proposed and it is now with the chromosome 15 and it has been already running for 21 hours and counting...

Pablo.

**gsgs** · 12-14-2012, 01:16 AM

searching what NGS is ...

Wang,Y.,Lu,J.,Yu,J.,Gibbs,R.&Yu,F.An integrative variant analysis
pipeline for accurate genotype/haplotype inference in population NGS
data.n$revision(2012).

Search

https://www.hgsc.bcm.edu/content/SNPTools

do they have, what you want ?

NGS = next generation sequencing ?

**priesgo** · 12-14-2012, 01:43 AM

Mmmmm, sorry I didn't explain well myself.

Yes I was referring to Next Generation Sequencing. The thing is that the OMNI data you proposed is obtained via Illumina OMNI microarray genotyping, while the main data comes from NGS machines (Illumina, SOLiD or 454 in this case).
For my purpose I need the data obtained from NGS.

About SNPTools, it seems a good tool for SNP calling, but it does not manage VCF files. So it does not really apply for this problem. But in aby case I didn't know it I'll have a look at it.

Regards,
Pablo.

**gsgs** · 12-14-2012, 03:20 AM

we get that giant (positions,sequences) matrix, but we need the transpose (sequences,positions)
transposing a big matrix needs much memory

better you split the matrix into smaller parts and transpose them separately ?!

"most difficult" says wikipedia

In-place matrix transposition - Wikipedia

http://en.wikipedia.org/wiki/In-place_matrix_transposition

**cwhelan** · 12-14-2012, 09:39 AM

It's strange that that is taking so long for you; I don't think it took as long for me. I ran each fille in parallel on different cores, that's one way to speed it up.

**priesgo** · 12-14-2012, 10:42 PM

gsgs I am trying to process this data using existing tools as vcftools I am avoiding getting into the guts of it. Anyway this may apply to optimize vcftools.

**albireo** · 06-10-2013, 07:36 AM

Hi, I have the same problem as the OP. I understand how to slice the chromosome data using VCF tools, however I would be glad if you could explain how to stream the data directly to VCF tools without downloading the huge .vcf files first. If I understand correctly, tabix is useless here because I want data for the full chromosome, not for an interval.

Any ideas? Thanks in advance.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Select only one sample in 1000 Genomes whole genome VCF

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News