Hi Folks,
Our standard process for preparing BAM files involves the following:
samtools sort
samtools rmdup
samtools sort
samtools index
20682963562 Mar 19 20:13 EAP120_R2.fastq.bam
12717567745 Mar 19 21:50 EAP120_R2.fastq.srt.bam
4529263043 Mar 19 22:11 EAP120_R2.fastq.nodup.bam
4529263062 Mar 19 22:38 EAP120_R2.fastq.nodup.srt.bam
So there will be 4 BAM files from this process. To evaluate them, I used samtools flagstat.
What I found with one of our samples is that the number of mapped reads drops from 17.8 billion to 5.8 billion. (The flagstat data are below.)
Why would that occur? Does that indicate poor quality data?
It would seem to me to indicate that there are a LOT of duplicates. Is that correct? Should I normally see a 75 % reduction in the number of mapped reads after duplicate removal?
So, I guess what I'm asking is how should I interpret flagstat results?
Thanks for any light you can shed on this.
Joe White
-----------------------
# initial BAM file
samtools flagstat EAP120_R2.fastq.bam
177833072 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates
172622167 + 0 mapped (97.07%:-nan%)
177833072 + 0 paired in sequencing
88916536 + 0 read1
88916536 + 0 read2
169943152 + 0 properly paired (95.56%:-nan%)
171556513 + 0 with itself and mate mapped
1065654 + 0 singletons (0.60%:-nan%)
1376256 + 0 with mate mapped to a different chr
1254219 + 0 with mate mapped to a different chr (mapQ>=5)
# after the first sort
samtools flagstat EAP120_R2.fastq.srt.bam
177833072 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates
172622167 + 0 mapped (97.07%:-nan%)
177833072 + 0 paired in sequencing
88916536 + 0 read1
88916536 + 0 read2
169943152 + 0 properly paired (95.56%:-nan%)
171556513 + 0 with itself and mate mapped
1065654 + 0 singletons (0.60%:-nan%)
1376256 + 0 with mate mapped to a different chr
1254219 + 0 with mate mapped to a different chr (mapQ>=5)
# after duplicate removal
samtools flagstat EAP120_R2.fastq.nodup.bam &
57792684 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates
52581779 + 0 mapped (90.98%:-nan%)
57792684 + 0 paired in sequencing
28898595 + 0 read1
28894089 + 0 read2
50015042 + 0 properly paired (86.54%:-nan%)
51516125 + 0 with itself and mate mapped
1065654 + 0 singletons (1.84%:-nan%)
1376256 + 0 with mate mapped to a different chr
1254219 + 0 with mate mapped to a different chr (mapQ>=5)
# after the second sort
samtools flagstat EAP120_R2.fastq.nodup.srt.bam &
57792684 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates
52581779 + 0 mapped (90.98%:-nan%)
57792684 + 0 paired in sequencing
28898595 + 0 read1
28894089 + 0 read2
50015042 + 0 properly paired (86.54%:-nan%)
51516125 + 0 with itself and mate mapped
1065654 + 0 singletons (1.84%:-nan%)
1376256 + 0 with mate mapped to a different chr
1254219 + 0 with mate mapped to a different chr (mapQ>=5)
Our standard process for preparing BAM files involves the following:
samtools sort
samtools rmdup
samtools sort
samtools index
20682963562 Mar 19 20:13 EAP120_R2.fastq.bam
12717567745 Mar 19 21:50 EAP120_R2.fastq.srt.bam
4529263043 Mar 19 22:11 EAP120_R2.fastq.nodup.bam
4529263062 Mar 19 22:38 EAP120_R2.fastq.nodup.srt.bam
So there will be 4 BAM files from this process. To evaluate them, I used samtools flagstat.
What I found with one of our samples is that the number of mapped reads drops from 17.8 billion to 5.8 billion. (The flagstat data are below.)
Why would that occur? Does that indicate poor quality data?
It would seem to me to indicate that there are a LOT of duplicates. Is that correct? Should I normally see a 75 % reduction in the number of mapped reads after duplicate removal?
So, I guess what I'm asking is how should I interpret flagstat results?
Thanks for any light you can shed on this.
Joe White
-----------------------
# initial BAM file
samtools flagstat EAP120_R2.fastq.bam
177833072 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates
172622167 + 0 mapped (97.07%:-nan%)
177833072 + 0 paired in sequencing
88916536 + 0 read1
88916536 + 0 read2
169943152 + 0 properly paired (95.56%:-nan%)
171556513 + 0 with itself and mate mapped
1065654 + 0 singletons (0.60%:-nan%)
1376256 + 0 with mate mapped to a different chr
1254219 + 0 with mate mapped to a different chr (mapQ>=5)
# after the first sort
samtools flagstat EAP120_R2.fastq.srt.bam
177833072 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates
172622167 + 0 mapped (97.07%:-nan%)
177833072 + 0 paired in sequencing
88916536 + 0 read1
88916536 + 0 read2
169943152 + 0 properly paired (95.56%:-nan%)
171556513 + 0 with itself and mate mapped
1065654 + 0 singletons (0.60%:-nan%)
1376256 + 0 with mate mapped to a different chr
1254219 + 0 with mate mapped to a different chr (mapQ>=5)
# after duplicate removal
samtools flagstat EAP120_R2.fastq.nodup.bam &
57792684 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates
52581779 + 0 mapped (90.98%:-nan%)
57792684 + 0 paired in sequencing
28898595 + 0 read1
28894089 + 0 read2
50015042 + 0 properly paired (86.54%:-nan%)
51516125 + 0 with itself and mate mapped
1065654 + 0 singletons (1.84%:-nan%)
1376256 + 0 with mate mapped to a different chr
1254219 + 0 with mate mapped to a different chr (mapQ>=5)
# after the second sort
samtools flagstat EAP120_R2.fastq.nodup.srt.bam &
57792684 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates
52581779 + 0 mapped (90.98%:-nan%)
57792684 + 0 paired in sequencing
28898595 + 0 read1
28894089 + 0 read2
50015042 + 0 properly paired (86.54%:-nan%)
51516125 + 0 with itself and mate mapped
1065654 + 0 singletons (1.84%:-nan%)
1376256 + 0 with mate mapped to a different chr
1254219 + 0 with mate mapped to a different chr (mapQ>=5)
Comment