I am trying to understand TopHat's --prefilter-multihits parameter. According to the documentation:
I ran TopHat 1.4.1 (last version before 2) and 2.0.9 with just --GTF parameter on the same sequences. TopHat 2.0.9 mapped more reads, but both versions ended up with about 20% of bases as intronic or intergenic. When I add --prefilter-multihits, TopHat 1.4.1 produces very similar results (~1% less mapped reads), which seems very reasonable to me. However, with TopHat 2.0.9, I lose over half the reads. Seems like a lot, but maybe it's possible they are all multi-mapped. More importantly, less than 1% of aligned reads are now intergenic or intronic.
Two questions:
1) Why such a huge difference in behavior between the two versions? As far as I can tell, this option was not altered for version 2.
2) Why does this parameter eliminate essentially all reads outside the transcriptome for TopHat 2.0.9?
When mapping reads on the transcriptome, some repetitive or low complexity reads that would be discarded in the context of the genome may appear to align to the transcript sequences and thus may end up reported as mapped to those genes only. This option directs TopHat to first align the reads to the whole genome in order to determine and exclude such multi-mapped reads (according to the value of the -g/--max-multihits option).
Two questions:
1) Why such a huge difference in behavior between the two versions? As far as I can tell, this option was not altered for version 2.
2) Why does this parameter eliminate essentially all reads outside the transcriptome for TopHat 2.0.9?