"There are 10 types of people in this world: those who assimilated binary numbers and those who didn't."
I definitely belong to the 10'th type and hence SAM Flags are a chore. They may be a very compact way of communicating a lot of info about an alignment, but how do we humans learn them? I know it is kind of nerdy to actually look through SAM files but, what can I say? Mea culpa.
Anyway, this post is my attempt to understand them like a natural language i.e. recognize some idiomatic representations in flags. If you already know these, you are a "binar" and way ahead of us humans on this topic.
You can use this handy little web page for specific flags:
However, to "speak SAM", we must know these flags without having to refer to a web page for each line. So, here are some simple idioms.
Unpaired Reads
For unpaired reads, the flags are very easy to recognize because there are only 3 values:
Paired Reads
For paired reads, 0'th bit HAS to be set. Hence all flags for paired reads HAVE to be odd. In other words, all even-numbered flags other than the above three (0, 4 and 16) are meaningless. (Good progress. We can recognize non-sense words. Writing a Jabberwocky poem with these flags is left as an exercise for the reader).
For paired reads all flags in the intervals [65-127] and [193-255] relate to the first read of a pair. All other (odd) flags refer to the second read in a pair.
"All Good"
Some values mean "all good" i.e. that both reads in the pair have aligned:
Sometimes both reads of a pair are flipped (reverse complemented) before mapping. If so, you get 113 or 177.
Other times only one of the reads in a pair is flipped though both of them map:
Exercise: Can you see why the number of reads with flag 113 must be equal to the number of reads with flag 177. Similarly, 81=161 and 97=145. If those numbers don't match, something went wrong with your aligner.
"All Bad"
At the other end of the spectrum we have "all bad" i.e. neither the read nor its mate mapped:
77 - 0001001101 - First in pair, both reads in pair unmapped. "All bad"
141 - 0010001101 - Second in pair and "all bad".
Only one read maps
Next, we have the cases when only one read in a pair is mapped.
Can you again see why number of reads with flag of 69 must be the same as the number of reads with flag of 137?
There are of course many other combinations. The purpose here is not to enumerate them but to simply have some fun with the structure of these flags.
What is your favorite flag? Do you have other ways of remembering what these things mean as you look through SAM files?
I definitely belong to the 10'th type and hence SAM Flags are a chore. They may be a very compact way of communicating a lot of info about an alignment, but how do we humans learn them? I know it is kind of nerdy to actually look through SAM files but, what can I say? Mea culpa.
Anyway, this post is my attempt to understand them like a natural language i.e. recognize some idiomatic representations in flags. If you already know these, you are a "binar" and way ahead of us humans on this topic.
You can use this handy little web page for specific flags:
However, to "speak SAM", we must know these flags without having to refer to a web page for each line. So, here are some simple idioms.
Unpaired Reads
For unpaired reads, the flags are very easy to recognize because there are only 3 values:
- 4 - 0000000100 - means "this is an unpaired read and is not mapped".
- 16 - 0000010000 - "this unpaired read is mapped in the reverse orientation".
- 0 - 0000000000 - "this unpaired read is mapped in the forward orientation".
Paired Reads
For paired reads, 0'th bit HAS to be set. Hence all flags for paired reads HAVE to be odd. In other words, all even-numbered flags other than the above three (0, 4 and 16) are meaningless. (Good progress. We can recognize non-sense words. Writing a Jabberwocky poem with these flags is left as an exercise for the reader).
For paired reads all flags in the intervals [65-127] and [193-255] relate to the first read of a pair. All other (odd) flags refer to the second read in a pair.
"All Good"
Some values mean "all good" i.e. that both reads in the pair have aligned:
- 65 - 0001000001 - this is first read in pair and both reads aligned the forward strand.
- 129 - 0010000001 - This is second read of pair and both reads aligned the forward strand.
NOTE: 67 (0001000011) and 131 (0010000011) also mean the same as 65 and 129 with the added assurance that "the pair is properly aligned" meaning that they mapped within a proper distance from each other.
- 113 - 0001110001 - "this is the first read of a pair, both reads in pair were flipped and both mapped".
- 177 - 0001110001 - "this is the second read of a pair, both reads in pair were flipped and both mapped".
Other times only one of the reads in a pair is flipped though both of them map:
- 81 - 0001010001 - "this is the first read of pair, both reads mapped, we had to flip this read, but mate is in forward orientation".
- 161 - 0010100001 - "this is second read, this one is forward but we flipped its mate and both reads mapped".
NOTE: 163 (0010100011) and 83 (0001010011) are the same as 161 and 81 except "it is in a proper pair".
- 97 - 0001100001 - "this is first read, its mate is flipped but this is forward. Both mapped".
- 145 - 0010010001 - "this is second read. it is flipped but its mate is not. Both mapped".
NOTE: 99 (0001100011) and 147 (0010010011) are the same as 97 and 145 except with "proper mapping in pair".
"All Bad"
At the other end of the spectrum we have "all bad" i.e. neither the read nor its mate mapped:
77 - 0001001101 - First in pair, both reads in pair unmapped. "All bad"
141 - 0010001101 - Second in pair and "all bad".
- Exercise: Just like with 20, AnnoyingAlign puts flags of 93 or 125 on all unmapped pairs. What other flags can AnnoyingAlign use to maximize user annoyance?
- Exercise: Why are 79 and 143 particularly good words for Jabberwocky?
Next, we have the cases when only one read in a pair is mapped.
- 69 - 0001000101 - First read in pair. This read is unmapped but its mate is mapped.
- 137 - 0010001001 - second in pair. Read is mapped but mate is unmapped.
- 73 - 0001001001 - First read in pair. This read is mapped but its mate is not.
- 133 - 0010000101 - 2nd in pair. Read unmapped but mate is mapped.
Can you again see why number of reads with flag of 69 must be the same as the number of reads with flag of 137?
There are of course many other combinations. The purpose here is not to enumerate them but to simply have some fun with the structure of these flags.
What is your favorite flag? Do you have other ways of remembering what these things mean as you look through SAM files?
Comment