View Single Post
Old 03-02-2014, 09:48 AM   #1
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default Cleaning MNase BAM file

I have this dataset of MNase.
the raw reads were aligned with bwa, the BAM file is about 60GB, and I wonder if I should filter it.

The header of the file looks like this:

@RG ID:T0-001 DT:2009-12-18 PG:BWA PI:150 PL:ILLUMINA
@RG ID:T0-002 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
@RG ID:T0-003 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
@RG ID:T0-004 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
@RG ID:T0-005 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
@RG ID:T0-006 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
@RG ID:T0-007 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
@RG ID:T0-008 DT:2011-03-18 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
@RG ID:T0-009 DT:2011-03-18 PG:BWA PI:150 PL:ILLUMINA PU:LANE4
@RG ID:T0-010 DT:2011-11-22 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
@RG ID:T0-011 DT:2011-11-22 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrM LN:16571
@SQ SN:MMTV LN:10937

Indicating that it is sorted, and it is a pileup(?) of many sequencing done over the years, (done on the same cell line)
Last two groups are paired-end, titled @RG ID:T0-010 and @RG ID:T0-011, the remaining are single-end.

Running samtools flagstat gives this output:

974253427 + 425863565 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
974252673 + 233075580 mapped (100.00%:54.73%)
603612708 + 143779894 paired in sequencing
301806108 + 71890193 read1
301806600 + 71889701 read2
603406978 + 82441932 properly paired (99.97%:57.34%)
603612645 + 91537096 with itself and mate mapped
0 + 7424943 singletons (0.00%:5.16%)
61464 + 8185912 with mate mapped to a different chr
61464 + 2965666 with mate mapped to a different chr (mapQ>=5)

First thing pops clearly is that there are quite a lot of reads who did not passed QC. However, 54.73% of them are aligned.

I saw that duplicates were not flagged but instead are tagged.

Considering all that, I would like to share some questions:
1) Should I filter out QC failed reads ?
I looked inside the file, how can a 51bp read with a running Phred score of 2 (ASCII chars consist only of "#" so I caculated 35 -33 = 2 , right?) get a CIGAR 51M ? and can I assume that it is a good alignment after all ?

2) How should I relate to duplicates ?
I run a "samtools view ...etc | grep -c ...etc " command to count the reads holding tags which indicate that they are duplicates. The output was 50169529 reads that are tagged as "duplicate", which id 3.6% of all the reads. I tend to relate to these reads as valid reads, considering that the cells were treated with MNase.

3) If I decide to filter out some reads. What are the consequences of removing a read that is a part of a pair ? I suspect that read2 will be left with a wrong flag, indicating it is paired while read1 is no longer present. Can this flag thing create some problems downstream along the pipeline ?

I would appreciate some thoughts of this. Thanks
Ohad is offline   Reply With Quote