SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to deal with biological replicates of nucleosome MNase-seq data? wisense Epigenetics 6 02-11-2014 12:09 PM
Convert merged BAM back to per lane BAM or FASTQ file danielsbrewer Bioinformatics 6 10-03-2013 07:29 AM
MNase-seq data analysis lix Bioinformatics 2 12-04-2012 09:19 AM
BAM file cleaning brofallon Bioinformatics 0 03-20-2012 06:41 AM

Reply
 
Thread Tools
Old 03-02-2014, 09:48 AM   #1
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default Cleaning MNase BAM file

Hi,
I have this dataset of MNase.
the raw reads were aligned with bwa, the BAM file is about 60GB, and I wonder if I should filter it.

The header of the file looks like this:

@RG ID:T0-001 DT:2009-12-18 PG:BWA PI:150 PL:ILLUMINA
@RG ID:T0-002 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
@RG ID:T0-003 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
@RG ID:T0-004 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
@RG ID:T0-005 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
@RG ID:T0-006 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
@RG ID:T0-007 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
@RG ID:T0-008 DT:2011-03-18 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
@RG ID:T0-009 DT:2011-03-18 PG:BWA PI:150 PL:ILLUMINA PU:LANE4
@RG ID:T0-010 DT:2011-11-22 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
@RG ID:T0-011 DT:2011-11-22 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrM LN:16571
@SQ SN:MMTV LN:10937

Indicating that it is sorted, and it is a pileup(?) of many sequencing done over the years, (done on the same cell line)
Last two groups are paired-end, titled @RG ID:T0-010 and @RG ID:T0-011, the remaining are single-end.

Running samtools flagstat gives this output:

974253427 + 425863565 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
974252673 + 233075580 mapped (100.00%:54.73%)
603612708 + 143779894 paired in sequencing
301806108 + 71890193 read1
301806600 + 71889701 read2
603406978 + 82441932 properly paired (99.97%:57.34%)
603612645 + 91537096 with itself and mate mapped
0 + 7424943 singletons (0.00%:5.16%)
61464 + 8185912 with mate mapped to a different chr
61464 + 2965666 with mate mapped to a different chr (mapQ>=5)


First thing pops clearly is that there are quite a lot of reads who did not passed QC. However, 54.73% of them are aligned.

I saw that duplicates were not flagged but instead are tagged.

Considering all that, I would like to share some questions:
1) Should I filter out QC failed reads ?
I looked inside the file, how can a 51bp read with a running Phred score of 2 (ASCII chars consist only of "#" so I caculated 35 -33 = 2 , right?) get a CIGAR 51M ? and can I assume that it is a good alignment after all ?

2) How should I relate to duplicates ?
I run a "samtools view ...etc | grep -c ...etc " command to count the reads holding tags which indicate that they are duplicates. The output was 50169529 reads that are tagged as "duplicate", which id 3.6% of all the reads. I tend to relate to these reads as valid reads, considering that the cells were treated with MNase.

3) If I decide to filter out some reads. What are the consequences of removing a read that is a part of a pair ? I suspect that read2 will be left with a wrong flag, indicating it is paired while read1 is no longer present. Can this flag thing create some problems downstream along the pipeline ?

I would appreciate some thoughts of this. Thanks
Ohad is offline   Reply With Quote
Old 03-02-2014, 12:36 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by Ohad View Post
Considering all that, I would like to share some questions:
1) Should I filter out QC failed reads ?
Yes, unless that somehow biases the data or drops your coverage too low. Otherwise you'll be adding a lot of noise, and the noise may be biased in some way. Do you know why the %failed is so high?

Quote:
I looked inside the file, how can a 51bp read with a running Phred score of 2 (ASCII chars consist only of "#" so I caculated 35 -33 = 2 , right?) get a CIGAR 51M ? and can I assume that it is a good alignment after all ?
Look at the mapping score, not the cigar string. Old-style cigar strings are worthless without context: M stands for Match or Mismatch. New-style cigar strings use '=' for match and 'X' for mismatch, and therefore actually give you useful information. Also, 2 is a "special" quality score for Illumina. 5 to 41 are normal Phred values, but 2 indicates something else (like, that two clusters are too close together to distinguish) and is usually much more accurate than a true Q2 base, up to a measured Q11~Q15 in my tests on HighSeq data. Bases with Q2 should not be used for error-sensitive things like assembly and variation calling, but they can increase alignment accuracy if the aligner is error-tolerant, because longer reads have less ambiguity. BWA is not very error-tolerant, though.

Quote:
3) If I decide to filter out some reads. What are the consequences of removing a read that is a part of a pair ? I suspect that read2 will be left with a wrong flag, indicating it is paired while read1 is no longer present. Can this flag thing create some problems downstream along the pipeline ?
I do not recommend removing singletons from a paired dataset. Use a tool that can process paired reads in their original fastq, and if either fails your filtering criteria are, then remove both. I like to create 2 outputs, one for good pairs and one for orphaned singletons, so as to conserve all of the good data (though in practice the orphaned singletons file usually gets ignored unless there is a severe lack of coverage otherwise). Then map the clean files. This way saves a lot of processing and results in smaller intermediate files. And it's much easier, since pairs in the original fastq pairs will be together, and thus can be processed as a stream, whereas in a BAM they could be anywhere.
Brian Bushnell is offline   Reply With Quote
Old 03-02-2014, 03:52 PM   #3
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default

Quote:
Originally Posted by Brian Bushnell View Post
Yes, unless that somehow biases the data or drops your coverage too low. Otherwise you'll be adding a lot of noise, and the noise may be biased in some way. Do you know why the %failed is so high?

Look at the mapping score, not the cigar string. Old-style cigar strings are worthless without context: M stands for Match or Mismatch. New-style cigar strings use '=' for match and 'X' for mismatch, and therefore actually give you useful information. Also, 2 is a "special" quality score for Illumina. 5 to 41 are normal Phred values, but 2 indicates something else (like, that two clusters are too close together to distinguish) and is usually much more accurate than a true Q2 base, up to a measured Q11~Q15 in my tests on HighSeq data. Bases with Q2 should not be used for error-sensitive things like assembly and variation calling, but they can increase alignment accuracy if the aligner is error-tolerant, because longer reads have less ambiguity. BWA is not very error-tolerant, though.
I just finished doing fastqc.
I see most reads are 50bp, but there are 40 and 36 as well.
Quality control is GREEN and the box-plot way of distribution shows a large amount of reads with the "special value" of 2.
I counted 58,180,456 of them using grep, not close enough to explain ~425M bad reads.

Looking for some cases I saw these three:


HWI-ST227_83:3:23:15749:55015 528 chr1 9995 25 50M * 0 165
TTACGATATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
##################################################
RG:Z:T0-008

HWI-ST227:145:C06RAACXX:1:1104:11867:197280 629 chr1 9996 0 * = 9996 129
TCTGATCTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGT
###########################################?2+A1:1
RG:Z:T0-010

HWI-ST227_83:4:65:8785:4808 512 chr1 10000 25 50M * 0 165
ATTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTATCCGT
##################################################
RG:Z:T0-009

Rolling down some few more lines on my screen I see flags 512 / 528 are quite common.

Perhaps those "2" quality lines are fine or explainable.

But another thought in mind is that it seems, by the read lengths and sequencing date, that this is an old data and perhaps the technology was still new and quality calling was less advanced.

I also see RED cross on 73.3% Duplication Level which is to be bothered by it, and some Kmer popping on some unique positions in the read. Though for Kmer I got the ORANGE escalation mark.

Quote:
I do not recommend removing singletons from a paired dataset. Use a tool that can process paired reads in their original fastq, and if either fails your filtering criteria are, then remove both. I like to create 2 outputs, one for good pairs and one for orphaned singletons, so as to conserve all of the good data (though in practice the orphaned singletons file usually gets ignored unless there is a severe lack of coverage otherwise). Then map the clean files. This way saves a lot of processing and results in smaller intermediate files. And it's much easier, since pairs in the original fastq pairs will be together, and thus can be processed as a stream, whereas in a BAM they could be anywhere.
I don't have the original raw files, and it would be a problem for my small server to hold those files if I extract them back from BAM.

My current direction of thought is to break this BAM into Read Groups and start evaluating each group's sequence quality singly.

Last edited by Ohad; 03-02-2014 at 04:13 PM.
Ohad is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:05 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO