I have a fasta, most of the sequence is from repeat region, so they can be mapped to reference in sam lots of times. I use bwa mem with "-a" option, so they appear many times in different position of reference. Now I want to calculate the total matched bases / total bases in the sam file. I use samtools stats result.sam, the report is below:
# This file was produced by samtools stats (1.3.1+htslib-1.3.1) and can be plott
ed using plot-bamstats
# This file contains statistics for all reads.
# The command line was: stats result.sam
# CHK, Checksum [2]Read Names [3]Sequences [4]Qualities
# CHK, CRC32 of reads which passed filtering followed by addition (32bit overflo
w)
CHK 8146bdfb 3b00f1ed 091ce229
# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN raw total sequences: 961
SN filtered sequences: 0
SN sequences: 961
SN is sorted: 0
SN 1st fragments: 961
SN last fragments: 0
SN reads mapped: 961
SN reads mapped and paired: 0 # paired-end technology bit set
+ both mates mapped
SN reads unmapped: 0
SN reads properly paired: 0 # proper-pair bit set
SN reads paired: 0 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 384 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 24643
SN total length: 4353396 # ignores clipping
SN bases mapped: 4353396 # ignores clipping
SN bases mapped (cigar): 4428969 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 79181 # from NM fields
SN error rate: 1.787798e-02 # mismatches / bases mapped (cigar)
SN average length: 4530
SN maximum length: 14146
SN average quality: 255.0
SN insert size average: 0.0
SN insert size standard deviation: 0.0
SN inward oriented pairs: 0
SN outward oriented pairs: 0
SN pairs with other orientation: 0
SN pairs on different chromosomes: 0
The total mapped bases(cigar) 4428969 is longer than the total length 4353396, because the my reads are mostly repeat? What is the total bases mean in the report? So if I want to get the ratio of total matched bases / total bases, do I need to analysis cigar myself ? I try picard, but it always report error though I validate the sam without error.
# This file was produced by samtools stats (1.3.1+htslib-1.3.1) and can be plott
ed using plot-bamstats
# This file contains statistics for all reads.
# The command line was: stats result.sam
# CHK, Checksum [2]Read Names [3]Sequences [4]Qualities
# CHK, CRC32 of reads which passed filtering followed by addition (32bit overflo
w)
CHK 8146bdfb 3b00f1ed 091ce229
# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN raw total sequences: 961
SN filtered sequences: 0
SN sequences: 961
SN is sorted: 0
SN 1st fragments: 961
SN last fragments: 0
SN reads mapped: 961
SN reads mapped and paired: 0 # paired-end technology bit set
+ both mates mapped
SN reads unmapped: 0
SN reads properly paired: 0 # proper-pair bit set
SN reads paired: 0 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 384 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 24643
SN total length: 4353396 # ignores clipping
SN bases mapped: 4353396 # ignores clipping
SN bases mapped (cigar): 4428969 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 79181 # from NM fields
SN error rate: 1.787798e-02 # mismatches / bases mapped (cigar)
SN average length: 4530
SN maximum length: 14146
SN average quality: 255.0
SN insert size average: 0.0
SN insert size standard deviation: 0.0
SN inward oriented pairs: 0
SN outward oriented pairs: 0
SN pairs with other orientation: 0
SN pairs on different chromosomes: 0
The total mapped bases(cigar) 4428969 is longer than the total length 4353396, because the my reads are mostly repeat? What is the total bases mean in the report? So if I want to get the ratio of total matched bases / total bases, do I need to analysis cigar myself ? I try picard, but it always report error though I validate the sam without error.