Happy New year 2014 All,
We have been running some tests with the new GRC38 and our aligner, Novoalign. It seems there are some alternate sequences that are in there which Im assuming are for completeness sake.
Are there any recommendations for how we should be aligning to this new build i.e. do we stick with the methods used in GATK best practices where we align to the complete chromosome sequences?
Based on some of our own testing we see that These alternate assembly sequences have 111Kbp and cover 3.6% of human genome. The big question... Should these be used in the search space?
So we aligned some Exome reads against full GRC38 and GRC38 without the alternate assemble sequences.
Without Alt sequences..
# Pairs Aligned: 11,939,446
# Unique Alignment: 22,945,232
with Alt sequences
# Pairs Aligned: 11,942,662
# Unique Alignment: 22,278,528
So including the Alt sequences aligned 0.03% extra pairs however there were 3% extra multi-mapped reads which is pretty high considering the alternate sequences are 3.6% of the genome. These multi-mappers are going to result in low quality alignments and seriously impact variant calling for these regions.
So for now we feel that not including the alternate sequences may be a better approach. We would like to get a feel for what the general consensus is among the community. Are there any papers released by the Broad or some of the other centers that address this?
We have been running some tests with the new GRC38 and our aligner, Novoalign. It seems there are some alternate sequences that are in there which Im assuming are for completeness sake.
Are there any recommendations for how we should be aligning to this new build i.e. do we stick with the methods used in GATK best practices where we align to the complete chromosome sequences?
Based on some of our own testing we see that These alternate assembly sequences have 111Kbp and cover 3.6% of human genome. The big question... Should these be used in the search space?
So we aligned some Exome reads against full GRC38 and GRC38 without the alternate assemble sequences.
Without Alt sequences..
# Pairs Aligned: 11,939,446
# Unique Alignment: 22,945,232
with Alt sequences
# Pairs Aligned: 11,942,662
# Unique Alignment: 22,278,528
So including the Alt sequences aligned 0.03% extra pairs however there were 3% extra multi-mapped reads which is pretty high considering the alternate sequences are 3.6% of the genome. These multi-mappers are going to result in low quality alignments and seriously impact variant calling for these regions.
So for now we feel that not including the alternate sequences may be a better approach. We would like to get a feel for what the general consensus is among the community. Are there any papers released by the Broad or some of the other centers that address this?