I have been trying to map genomic Illumina data to a divergent reference in order to get the sequence of exons in my sample. The idea is that although the introns and intergenic regions may be too variable to map the coding sites are more conserved and may be alignable.
It seems like it may be a task that someone else has tackled but I can't find any reference to it.
I am expecting about 10% of sites to differ in exons (but ~30% outside of exons). So far, I have used BWA and STAMPY and I can recover about 50% of the exome but haven't been able to get the rest. I worry that exon-intron boundaries are hard to map because only some of the read will align well and potentially its mate won't map at all if its out in the intergenic DNA or in an intron.
Some things I have been toying with:
(1) Masking noncoding DNA
(2) Converting obvious differences in the reference to reduce divergence and repeating the alignment
(3) de novo assembly of short contigs and then BLASTing them against reference exons.
If anyone has any experience with a similar task it would be valuable to know their experiences.
It seems like it may be a task that someone else has tackled but I can't find any reference to it.
I am expecting about 10% of sites to differ in exons (but ~30% outside of exons). So far, I have used BWA and STAMPY and I can recover about 50% of the exome but haven't been able to get the rest. I worry that exon-intron boundaries are hard to map because only some of the read will align well and potentially its mate won't map at all if its out in the intergenic DNA or in an intron.
Some things I have been toying with:
(1) Masking noncoding DNA
(2) Converting obvious differences in the reference to reduce divergence and repeating the alignment
(3) de novo assembly of short contigs and then BLASTing them against reference exons.
If anyone has any experience with a similar task it would be valuable to know their experiences.
Comment