I understand fully what it means when a read is mapped at coordinate X to ref sequence NM_0012345 with a certain quality score etc.
I also understand fully if a read is unmapped (sam flag 4 is set). In most cases with these the Refname field of a SAM file is "*". Makes sense since there is no reference sequence.
I recently ran BWA on a set of unpaired reads and came upon lines in the SAM file like this:
R001778658_15_1 4 NM_008409.2 1536 18 100M ...
R002786437_15_1 4 NM_008409.2 1536 25 100M ...
R002993094_15_1 4 NM_008409.2 1535 23 100M ...
I have truncated each line (removing seqs etc) for clarity. With a SAM flag of 4 (meaning the read is unmapped) it makes no sense to have a reference name like NM_008409. So I investigated what this means.
The answer is that each of these alignments hangs off the 3' end of the ref sequence by a little bit: NM_008409 is 1632 long and these alignments begin at 1535 or 1536. Hence the reads (each 100 NTs long) stretch beyond the 3 prime end of the reference by 3 or 4 bases. Other than that the alignment is perfect (yes, I checked).
Since this is RNASeq, I would certainly want to count these reads toward this gene. So, based on the evidence thus far, I am inclined to ignore the SAM flag and consider the reference name for counting the tags. But I am not sure that this is the only type of situation where SAM flag 4 is set and a reference name is mentioned.
How do you deal with nominally "unmapped" reads which are nevertheless assigned to a particular reference sequence? Do you ignore the SAM flag or the ref name?
I also understand fully if a read is unmapped (sam flag 4 is set). In most cases with these the Refname field of a SAM file is "*". Makes sense since there is no reference sequence.
I recently ran BWA on a set of unpaired reads and came upon lines in the SAM file like this:
R001778658_15_1 4 NM_008409.2 1536 18 100M ...
R002786437_15_1 4 NM_008409.2 1536 25 100M ...
R002993094_15_1 4 NM_008409.2 1535 23 100M ...
I have truncated each line (removing seqs etc) for clarity. With a SAM flag of 4 (meaning the read is unmapped) it makes no sense to have a reference name like NM_008409. So I investigated what this means.
The answer is that each of these alignments hangs off the 3' end of the ref sequence by a little bit: NM_008409 is 1632 long and these alignments begin at 1535 or 1536. Hence the reads (each 100 NTs long) stretch beyond the 3 prime end of the reference by 3 or 4 bases. Other than that the alignment is perfect (yes, I checked).
Since this is RNASeq, I would certainly want to count these reads toward this gene. So, based on the evidence thus far, I am inclined to ignore the SAM flag and consider the reference name for counting the tags. But I am not sure that this is the only type of situation where SAM flag 4 is set and a reference name is mentioned.
How do you deal with nominally "unmapped" reads which are nevertheless assigned to a particular reference sequence? Do you ignore the SAM flag or the ref name?
Comment