Several members have noted strange bugs in the TopHat accepted_hits.sam file. One bug mentioned by many people (see here, here and here), is that the MRNM-field (7th row) in the SAM-file is reported as '=' even if the mate reference sequence name is different (e.g. different chromosomes). This is contradictory to how the SAM-format is defined, and will cause highly unpredictable behavior.
I recently tested the newest TopHat (v. 1.0.14) and the error is still present, even though the authors seem to be aware of the issue!
In addition I noticed additional strange behavior, that I will show with an example:
HWUSI-EAS697:1:73:6760:7084#0 147 chr1 21032 255 76M = 21062 0 GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NM:i:0 NH:i:1
HWUSI-EAS697:1:73:6760:7084#0 99 chr9 21062 255 76M = 21032 0 CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@? NM:i:0 NH:i:1
Here you see two mate-pairs apparently uniquely mapped to chr1 and chr9. Obviously the MRNM-problem is still there, but I got a bit suspicious about the results as well. If I blat both sequences, I get these results (only best-hits shown):
Blat-result 1st seq:
1 76 76 100.0% 9 + 21145 21220 76
1 76 76 100.0% 19 + 62640 62715 76
1 76 76 100.0% 1 + 21032 21107 76
Blat-result 2nd seq:
1 76 76 100.0% 15 - 102510140 102510215 76
1 76 76 100.0% 9 + 21062 21137 76
As you can see, the mapping info in the SAM-output is indeed corresponding to the BLAT-output. However, what happened to the other hits? In fact, in this case anyone can see that the correct pairing should be on chr9 only, and not between chr1 and chr9! Does anyone else see this issue? It appears to be very prevalent, and it almost appears as if TopHat just chooses randomly one of the hits from each paired end.
I recently tested the newest TopHat (v. 1.0.14) and the error is still present, even though the authors seem to be aware of the issue!
In addition I noticed additional strange behavior, that I will show with an example:
HWUSI-EAS697:1:73:6760:7084#0 147 chr1 21032 255 76M = 21062 0 GGGGAGAGAGTCTCTCCCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCATGTTTAAGGGGACGGGTTCAAAGCTGG CB>DBD@=:@=D;CACACDBDDDADCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NM:i:0 NH:i:1
HWUSI-EAS697:1:73:6760:7084#0 99 chr9 21062 255 76M = 21032 0 CAGGAGCTCACCTGCCTGCGTCACTGGGCACAGACGCCAGTGAGGCCAGAGGCCGGGCTGTGCTGGGGCCTGAGCT CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDCBCCCC=@C8?7<?CAA4C?A=A@? NM:i:0 NH:i:1
Here you see two mate-pairs apparently uniquely mapped to chr1 and chr9. Obviously the MRNM-problem is still there, but I got a bit suspicious about the results as well. If I blat both sequences, I get these results (only best-hits shown):
Blat-result 1st seq:
1 76 76 100.0% 9 + 21145 21220 76
1 76 76 100.0% 19 + 62640 62715 76
1 76 76 100.0% 1 + 21032 21107 76
Blat-result 2nd seq:
1 76 76 100.0% 15 - 102510140 102510215 76
1 76 76 100.0% 9 + 21062 21137 76
As you can see, the mapping info in the SAM-output is indeed corresponding to the BLAT-output. However, what happened to the other hits? In fact, in this case anyone can see that the correct pairing should be on chr9 only, and not between chr1 and chr9! Does anyone else see this issue? It appears to be very prevalent, and it almost appears as if TopHat just chooses randomly one of the hits from each paired end.
Comment