We recently got Illumina mate pair sequencing data for a human sample. I am trying to do some some Quality Check to see how good the data is.
I used bwa for alignment and the output bam file gave these numbers:
324061746 + 0 in total (QC-passed reads + QC-failed reads)
265547763 + 0 mapped (81.94%:-nan%)
239584614 + 0 properly paired (73.93%:-nan%)
244063409 + 0 with itself and mate mapped
21484354 + 0 singletons (6.63%:-nan%)
993167 + 0 with mate mapped to a different chr
812922 + 0 with mate mapped to a different chr (mapQ>=5)
I knew (from FastQC) that the data had high rates of duplication (>90%), so I marked and removed duplicates with picard to get these numbers from the dedupped bam file:
114026332 + 0 in total (QC-passed reads + QC-failed reads)
55512349 + 0 mapped (48.68%:-nan%)
114026332 + 0 paired in sequencing
57259162 + 0 read1
56767170 + 0 read2
49051344 + 0 properly paired (43.02%:-nan%)
50304349 + 0 with itself and mate mapped
5208000 + 0 singletons (4.57%:-nan%)
371423 + 0 with mate mapped to a different chr
246089 + 0 with mate mapped to a different chr (mapQ>=5)
As you can see, ~200 M reads are duplicates and consequently, removed from the alignment file.
My questions are:
1) Is it reasonable for human mate-pair libraries (insert size ~5kb) to have such high rates of duplication?
2) Does this reflect an average/good/bad mate pair sequencing run?
3) Any other suggestions regarding checking quality of mate pair sequencing data in general.
I have looked at other threads and this seemed like the only one somewhat relevant.
Thanks
I used bwa for alignment and the output bam file gave these numbers:
324061746 + 0 in total (QC-passed reads + QC-failed reads)
265547763 + 0 mapped (81.94%:-nan%)
239584614 + 0 properly paired (73.93%:-nan%)
244063409 + 0 with itself and mate mapped
21484354 + 0 singletons (6.63%:-nan%)
993167 + 0 with mate mapped to a different chr
812922 + 0 with mate mapped to a different chr (mapQ>=5)
I knew (from FastQC) that the data had high rates of duplication (>90%), so I marked and removed duplicates with picard to get these numbers from the dedupped bam file:
114026332 + 0 in total (QC-passed reads + QC-failed reads)
55512349 + 0 mapped (48.68%:-nan%)
114026332 + 0 paired in sequencing
57259162 + 0 read1
56767170 + 0 read2
49051344 + 0 properly paired (43.02%:-nan%)
50304349 + 0 with itself and mate mapped
5208000 + 0 singletons (4.57%:-nan%)
371423 + 0 with mate mapped to a different chr
246089 + 0 with mate mapped to a different chr (mapQ>=5)
As you can see, ~200 M reads are duplicates and consequently, removed from the alignment file.
My questions are:
1) Is it reasonable for human mate-pair libraries (insert size ~5kb) to have such high rates of duplication?
2) Does this reflect an average/good/bad mate pair sequencing run?
3) Any other suggestions regarding checking quality of mate pair sequencing data in general.
I have looked at other threads and this seemed like the only one somewhat relevant.
Thanks
Comment