Will a split alignment covering a large indel report the intervening bases as gaps, or '-' alleles, in an mpileup output from that alignment?
I have two pooled sequencing samples that represent a mix of two genomes. I want to determine the relative allele frequencies in the two pools. The two genomes have extensive structural variation that I would like to avoid ignoring in the analysis.
I generated a mixed sequence by aligning the genomes, and generating a new reference in which any gaps in one sequence in the alignment are replaced by the corresponding allele in the other genome.
I then aligned the reads from the pools to the mixed reference using bwa to allow for split alignments, followed by running mpileup and determining allele frequencies (ended up using a script from popoolation to parse the vcf and get these values). I filtered out anything below Q20 mapping quality prior to determining allele frequencies.
However, whenever I have a site that is in a large indel that has a base in Genome A, but no base in Genome B, the '-' alleles are not being reported.
Does the alignment contain this information? Are the '-' alleles not actually recorded when a read has a split alignment?
Short indels appear to be working correctly, but not long ones. Do I need to be using a different set of tools to analyze these regions separately from the SNPs and short indels?
Any help would be greatly appreciated.
I have two pooled sequencing samples that represent a mix of two genomes. I want to determine the relative allele frequencies in the two pools. The two genomes have extensive structural variation that I would like to avoid ignoring in the analysis.
I generated a mixed sequence by aligning the genomes, and generating a new reference in which any gaps in one sequence in the alignment are replaced by the corresponding allele in the other genome.
I then aligned the reads from the pools to the mixed reference using bwa to allow for split alignments, followed by running mpileup and determining allele frequencies (ended up using a script from popoolation to parse the vcf and get these values). I filtered out anything below Q20 mapping quality prior to determining allele frequencies.
However, whenever I have a site that is in a large indel that has a base in Genome A, but no base in Genome B, the '-' alleles are not being reported.
Does the alignment contain this information? Are the '-' alleles not actually recorded when a read has a split alignment?
Short indels appear to be working correctly, but not long ones. Do I need to be using a different set of tools to analyze these regions separately from the SNPs and short indels?
Any help would be greatly appreciated.
Comment