Dear SEQanswers users,
I am trying to work out case-control association tests based on the raw data rather than the genotype calls. The idea is to use the samtools mpileup feature to stream 1,000 genomes indexed BAM files from the web to create vcf files that are combined for cases and controls. We face 2 difficulties:
1- we are limited by the nb of ftp connections so that the work cannot be made fully parallel (we are thinking one gene per job but it simply does not work).
2- Independently of the ftp issue, mpileup is very slow. Doing a genome-wide case control test is incredibly time consuming owing to the computations behind mpileup.
Have others considered the same problem, and have solutions been found? Is basing the test on the published genotype calls really the only option? We are keen to go back to the BAM files if possible but it seems challenging.
Thank you in advance for your help,
Vincent
I am trying to work out case-control association tests based on the raw data rather than the genotype calls. The idea is to use the samtools mpileup feature to stream 1,000 genomes indexed BAM files from the web to create vcf files that are combined for cases and controls. We face 2 difficulties:
1- we are limited by the nb of ftp connections so that the work cannot be made fully parallel (we are thinking one gene per job but it simply does not work).
2- Independently of the ftp issue, mpileup is very slow. Doing a genome-wide case control test is incredibly time consuming owing to the computations behind mpileup.
Have others considered the same problem, and have solutions been found? Is basing the test on the published genotype calls really the only option? We are keen to go back to the BAM files if possible but it seems challenging.
Thank you in advance for your help,
Vincent
Comment