I'm having some problems with samtools flagstat (which surprisingly should be straightforward as it has no parameters that can be wrongly defined). Samtools flagstat is used to return summaries of the flags produced within bam files. I was given intensities from exome sequencing, converted them into fastq files and removed reads with low quality (fastq format). I ran an alignment on fastq files using BWA, and got the following weird results from flagstat:
109638024 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
108394681 + 0 mapped (98.87%:nan%)
109638024 + 0 paired in sequencing
54819012 + 0 read1
54819012 + 0 read2
106794832 + 0 properly paired (97.41%:nan%)
108037629 + 0 with itself and mate mapped
357052 + 0 singletons (0.33%:nan%)
1071345 + 0 with mate mapped to a different chr
977974 + 0 with mate mapped to a different chr (mapQ>=5)
Common biological knowledge dictates that every sequencing should produce 10-15% PCR or optical duplicates, which is why I was surprised when I got 0 duplicates (I tried it on several samples in the same run, and they all gave 0 + 0 duplicates, which makes me feel something is wrong with flagstat). It got even weirder after I did samtools rmdup and retrieved the flagstat for that bam file:
59158619 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
57915276 + 0 mapped (97.90%:nan%)
59158619 + 0 paired in sequencing
26999098 + 0 read1
32159521 + 0 read2
56923778 + 0 properly paired (96.22%:nan%)
57745533 + 0 with itself and mate mapped
169743 + 0 singletons (0.29%:nan%)
727165 + 0 with mate mapped to a different chr
672895 + 0 with mate mapped to a different chr (mapQ>=5)
So about half of the original reads are kept, i.e. there are some duplicates, but they were not flagged in bwa. How could this have happened?
109638024 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
108394681 + 0 mapped (98.87%:nan%)
109638024 + 0 paired in sequencing
54819012 + 0 read1
54819012 + 0 read2
106794832 + 0 properly paired (97.41%:nan%)
108037629 + 0 with itself and mate mapped
357052 + 0 singletons (0.33%:nan%)
1071345 + 0 with mate mapped to a different chr
977974 + 0 with mate mapped to a different chr (mapQ>=5)
Common biological knowledge dictates that every sequencing should produce 10-15% PCR or optical duplicates, which is why I was surprised when I got 0 duplicates (I tried it on several samples in the same run, and they all gave 0 + 0 duplicates, which makes me feel something is wrong with flagstat). It got even weirder after I did samtools rmdup and retrieved the flagstat for that bam file:
59158619 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
57915276 + 0 mapped (97.90%:nan%)
59158619 + 0 paired in sequencing
26999098 + 0 read1
32159521 + 0 read2
56923778 + 0 properly paired (96.22%:nan%)
57745533 + 0 with itself and mate mapped
169743 + 0 singletons (0.29%:nan%)
727165 + 0 with mate mapped to a different chr
672895 + 0 with mate mapped to a different chr (mapQ>=5)
So about half of the original reads are kept, i.e. there are some duplicates, but they were not flagged in bwa. How could this have happened?
Comment