I'm trying to phase a specific genome against the 1000 genomes data.
What I've done so far:
Now I'm trying to run this with
Notice the markers=reference.markers option.
I also tried it with markers=mydata.markers, but I'm not entirely sure which markers file I should be passing to Beagle (<-- question #1).
(Also experimented with like=mydata.bgl.gz, didn't affect anything.)
But there's no difference, because it fails after about 10 seconds with 'ERROR: Allele in data file but not in marker file: allele "C" for marker rs6603854', or a different allele and marker, depending on which marker file I'm using. How do I fix this? (<-- question #2)
Sometimes (I couldn't replicate this now), it fails with "number of fields in the .gprobs file is not divisible by 3", even though I don't have any .gprobs file to begin with, and it is only generated by Beagle itself. What does this even mean? (<-- question #3)
Also, am I using the correct files (are they all for hg19?) and tools, even? (<-- question #4)
Any help will be much appreciated!
What I've done so far:
- fastq-dump --split-3: SRA -> FastQ
- FastQC
- trim_galore
- bowtie2: alignment against hg19
- samtools, bcftools: SAM/BAM -> VCF
- vcftools: VCF -> VCF for chr1 with minQ=50
- since the resulting file doesn't have rsIDs in it (the column contains dots), I downloaded snp138.txt.gz from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database and used the data from it to fill in the missing rsIDs (wrote a script; there's a few rsIDs that are present in the VCF, but missing in snp138, so I ditched the corresponding records)
- vcf2beagle.jar: mydata.vcf -> mydata.bgl.gz, mydata.int, mydata.markers
- Downloaded the "reference" files for chr1 from this site: http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/ (the bigger one, ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz and its index).
- vcf2beagle.jar: reference.vcf -> reference.bgl.gz, reference.int, reference.markers.
Now I'm trying to run this with
Code:
java -jar beagle.jar unphased=mydata.bgl.gz phased=reference.bgl.gz markers=reference.markers missing=? out=phased
I also tried it with markers=mydata.markers, but I'm not entirely sure which markers file I should be passing to Beagle (<-- question #1).
(Also experimented with like=mydata.bgl.gz, didn't affect anything.)
But there's no difference, because it fails after about 10 seconds with 'ERROR: Allele in data file but not in marker file: allele "C" for marker rs6603854', or a different allele and marker, depending on which marker file I'm using. How do I fix this? (<-- question #2)
Sometimes (I couldn't replicate this now), it fails with "number of fields in the .gprobs file is not divisible by 3", even though I don't have any .gprobs file to begin with, and it is only generated by Beagle itself. What does this even mean? (<-- question #3)
Also, am I using the correct files (are they all for hg19?) and tools, even? (<-- question #4)
Any help will be much appreciated!