Hi,
I am working on a project where I am identifying new SNPs in a poorly studied organism. I have a set of NG reads that I have assembled against a reference genome and am looking to fine-tune the alignment around potential Indels to minimize the number of artifact SNPs detected (i.e. SNPs that appear to be present because of sequence misalignment around Indels). I have run the GATK Indel Targetcreator and Realigner programs a number of times and under different settings. I am still having alignment problems that will lead to the incorrect identification of SNPs, however, especially around repetitive regions.
I understand that aligning sequences around Indels and repetitive regions is notoriously difficult for computer programs. That said, is there any advice anyone can give me as to how to get the best target realignment results out of GATK? E.g. setting that work particularity well. I am looking at about 50,000 regions, so manually editing alignments is not very practical.
Cheers
Gwilymh
I am working on a project where I am identifying new SNPs in a poorly studied organism. I have a set of NG reads that I have assembled against a reference genome and am looking to fine-tune the alignment around potential Indels to minimize the number of artifact SNPs detected (i.e. SNPs that appear to be present because of sequence misalignment around Indels). I have run the GATK Indel Targetcreator and Realigner programs a number of times and under different settings. I am still having alignment problems that will lead to the incorrect identification of SNPs, however, especially around repetitive regions.
I understand that aligning sequences around Indels and repetitive regions is notoriously difficult for computer programs. That said, is there any advice anyone can give me as to how to get the best target realignment results out of GATK? E.g. setting that work particularity well. I am looking at about 50,000 regions, so manually editing alignments is not very practical.
Cheers
Gwilymh