I am wondering if anyone has ever encountered similar problem. I used the latest version of Bismark for my data(pair end 100bp). After a closer look at the output sam file, I found there are some reads which is pretty short like this:
MWR-PRG-0014:95: D1HU8ACXX:6:1101:1417:2210_1:N:0:GCCAAT/1 115 chr5 75301330 255 100M = 75301318 -112 CAACCAAACAAAATACAACTATAAACTAAACTCTACCCATAAAACGCCTATTTACAAACTCCGTCCCCTAACCTAAAAAAACTAAAAAACCATTTCAAAC #B?@ACCBCDCB@AAADDC>>:
C@CCDA@:C:?>5BACCC=933CEBEEEEEE?HC=(A6DGGEEGIHEDBHCCFIIIIIGIHDC1EA4FHHDDD?D?;@ NM:i:23 XX:Z:2G14G2G1G1G2GG5G6GG6G3G7T7G4GGGG5GGGG10G2
XM:Z:..x..............x..x.h.h..xh.....x......hh..Z...x...h........X......x....xhhh.....xhhh..........x.. XR:Z:CT XG:Z:G
A
MWR-PRG-0014:95: D1HU8ACXX:6:1101:1417:2210_1:N:0:GCCAAT/2 179 chr5 75301318 255 100M = 75301330 -112 TCCCTCTCTCTACAACCAAACAAAATACAACTATAAACTAAACTCTACCCATAAAACGCCTATTTACAAACTCCGTCCCCTAACCTAAAAAAACTAAAAA @C@FFFDFHHFDHEEGIIGIJJ
JJJ<FGICFGJJIIGJIJJGJ@HIIIJJGGGIJEIBEHHIIIJIHEEEHHDFFFEACDD?CDDDDCD3@BDBDCC>CC NM:i:23 XX:Z:5T8G14G2G1G1G2GG5G6GG6G3G7T7G4GGGG5GGGG1-XM:Z:..............x..............x..x.h.h..xh.....x......hh..Z...x...h........X......x....xhhh.....xhhh. XR:Z:GA XG:Z:GA
which start at 75301318 and end at 75301330 at chromosome 5. (This is for Mmusculus mm9).
But if if get the exact sequence with mm9 by library BSgenome in R:
getSeq(Mmusculus, names= "chr5", start=75301318, end=75301330,as.character=TRUE)
[1] "TCCCTTTCTCTAC"
It seems it doesn't overlap with any bases in the read, then why is that map to the position.
Anyone knows the answer? I am curious to know. Thanks!
MWR-PRG-0014:95: D1HU8ACXX:6:1101:1417:2210_1:N:0:GCCAAT/1 115 chr5 75301330 255 100M = 75301318 -112 CAACCAAACAAAATACAACTATAAACTAAACTCTACCCATAAAACGCCTATTTACAAACTCCGTCCCCTAACCTAAAAAAACTAAAAAACCATTTCAAAC #B?@ACCBCDCB@AAADDC>>:
C@CCDA@:C:?>5BACCC=933CEBEEEEEE?HC=(A6DGGEEGIHEDBHCCFIIIIIGIHDC1EA4FHHDDD?D?;@ NM:i:23 XX:Z:2G14G2G1G1G2GG5G6GG6G3G7T7G4GGGG5GGGG10G2
XM:Z:..x..............x..x.h.h..xh.....x......hh..Z...x...h........X......x....xhhh.....xhhh..........x.. XR:Z:CT XG:Z:G
A
MWR-PRG-0014:95: D1HU8ACXX:6:1101:1417:2210_1:N:0:GCCAAT/2 179 chr5 75301318 255 100M = 75301330 -112 TCCCTCTCTCTACAACCAAACAAAATACAACTATAAACTAAACTCTACCCATAAAACGCCTATTTACAAACTCCGTCCCCTAACCTAAAAAAACTAAAAA @C@FFFDFHHFDHEEGIIGIJJ
JJJ<FGICFGJJIIGJIJJGJ@HIIIJJGGGIJEIBEHHIIIJIHEEEHHDFFFEACDD?CDDDDCD3@BDBDCC>CC NM:i:23 XX:Z:5T8G14G2G1G1G2GG5G6GG6G3G7T7G4GGGG5GGGG1-XM:Z:..............x..............x..x.h.h..xh.....x......hh..Z...x...h........X......x....xhhh.....xhhh. XR:Z:GA XG:Z:GA
which start at 75301318 and end at 75301330 at chromosome 5. (This is for Mmusculus mm9).
But if if get the exact sequence with mm9 by library BSgenome in R:
getSeq(Mmusculus, names= "chr5", start=75301318, end=75301330,as.character=TRUE)
[1] "TCCCTTTCTCTAC"
It seems it doesn't overlap with any bases in the read, then why is that map to the position.
Anyone knows the answer? I am curious to know. Thanks!
Comment