Hello,
I am starting to use Tophat (latest build) to map RNA seq reads (human) and I am trying to understand some of the results which I am seeing. I see many reads which Tophat calls as "unique" (based on both max MAPQ score of 50, and also NH flag =1) - yet when I simply BLAT the sequences - I see equal or better alignment to many (>20) locations in human genome (hg19) to which I am aligning. Example of two ends of paired reads are:
HWI-ST1220:175:C9Q45TEXX:1:1303:19652:99794 163 chr20 25165769 50 51M = 25165847 129 TTTTCTTTAAGAATGTTAAATATTGGCCCCCACTCTCTTCTGGCTTGTAGG CCCFFFFFHHHHHJJHHJJJJIJJIJJJJJJJJJJJJJJJJJJJJJGHGGE AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:23C26A0 YT:Z:UU XS:A:- NH:i:1
HWI-ST1220:175:C9Q45TEXX:1:1303:19652:99794 83 chr20 25165847 50 51M = 25165769 -129 CTGATGGGCTTCCCGTTGTGGGTAACCCGACCTTTCTCTCTGGCTGCCCTT HHHIGJJJIJJJJHHJJJJJJJIIHJJIIHJJJJIJJJHHHHHFFFFFCCB AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:14T31G4 YT:Z:UU XS:A:- NH:i:1
It does not appear that mapping of both ends together would produce any clearer unique result, and I see many examples of regions where I see many apparently unique aligned reads - yet mappability of these regions is very low, and when I go back and look at individual sequences - I see that they in fact align (using BLAT) to many locations. I understand that Tophat is going to place multiply-mapped reads somewhere - but is there some better method to determine confidence level of correct placement for each read? (I am using default pms for alignment.)
Thanks for everyone's help...
I am starting to use Tophat (latest build) to map RNA seq reads (human) and I am trying to understand some of the results which I am seeing. I see many reads which Tophat calls as "unique" (based on both max MAPQ score of 50, and also NH flag =1) - yet when I simply BLAT the sequences - I see equal or better alignment to many (>20) locations in human genome (hg19) to which I am aligning. Example of two ends of paired reads are:
HWI-ST1220:175:C9Q45TEXX:1:1303:19652:99794 163 chr20 25165769 50 51M = 25165847 129 TTTTCTTTAAGAATGTTAAATATTGGCCCCCACTCTCTTCTGGCTTGTAGG CCCFFFFFHHHHHJJHHJJJJIJJIJJJJJJJJJJJJJJJJJJJJJGHGGE AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:23C26A0 YT:Z:UU XS:A:- NH:i:1
HWI-ST1220:175:C9Q45TEXX:1:1303:19652:99794 83 chr20 25165847 50 51M = 25165769 -129 CTGATGGGCTTCCCGTTGTGGGTAACCCGACCTTTCTCTCTGGCTGCCCTT HHHIGJJJIJJJJHHJJJJJJJIIHJJIIHJJJJIJJJHHHHHFFFFFCCB AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:14T31G4 YT:Z:UU XS:A:- NH:i:1
It does not appear that mapping of both ends together would produce any clearer unique result, and I see many examples of regions where I see many apparently unique aligned reads - yet mappability of these regions is very low, and when I go back and look at individual sequences - I see that they in fact align (using BLAT) to many locations. I understand that Tophat is going to place multiply-mapped reads somewhere - but is there some better method to determine confidence level of correct placement for each read? (I am using default pms for alignment.)
Thanks for everyone's help...
Comment