What does one do for alignment and variant discovery when the reference sequence doesn't exactly provide the baseline expectation that you want? Specifically, I have sequencing data at several time points from an experimentally-evolved population of yeast. The yeast strain is YPH500, which has no published reference genome, so I've been using the standard S288C reference. Although this is very close in most places, there are numerous loci where the strains differ. So when I align the reads to S288C, of course there are many mismatches, but some are due to evolution occurring during our experiment (which are the main interest) and some are just the differences between YPH500 and S288C (which are not the main interest). Are there any standard strategies for dealing with this situation? Currently I'm thinking of just filtering out any variants in loci that appear to have major strain differences. This seems like a decent conservative approach, but I could lose interesting variants in the process.
Thanks in advance for any suggestions!
Thanks in advance for any suggestions!
Comment