Seqanswers Leaderboard Ad

**Brian Bushnell** · 03-02-2014, 01:36 PM

Originally posted by Ohad View Post

Considering all that, I would like to share some questions:
1) Should I filter out QC failed reads ?

Yes, unless that somehow biases the data or drops your coverage too low. Otherwise you'll be adding a lot of noise, and the noise may be biased in some way. Do you know why the %failed is so high?

I looked inside the file, how can a 51bp read with a running Phred score of 2 (ASCII chars consist only of "#" so I caculated 35 -33 = 2 , right?) get a CIGAR 51M ? and can I assume that it is a good alignment after all ?

Look at the mapping score, not the cigar string. Old-style cigar strings are worthless without context: M stands for Match or Mismatch. New-style cigar strings use '=' for match and 'X' for mismatch, and therefore actually give you useful information. Also, 2 is a "special" quality score for Illumina. 5 to 41 are normal Phred values, but 2 indicates something else (like, that two clusters are too close together to distinguish) and is usually much more accurate than a true Q2 base, up to a measured Q11~Q15 in my tests on HighSeq data. Bases with Q2 should not be used for error-sensitive things like assembly and variation calling, but they can increase alignment accuracy if the aligner is error-tolerant, because longer reads have less ambiguity. BWA is not very error-tolerant, though.

3) If I decide to filter out some reads. What are the consequences of removing a read that is a part of a pair ? I suspect that read2 will be left with a wrong flag, indicating it is paired while read1 is no longer present. Can this flag thing create some problems downstream along the pipeline ?

I do not recommend removing singletons from a paired dataset. Use a tool that can process paired reads in their original fastq, and if either fails your filtering criteria are, then remove both. I like to create 2 outputs, one for good pairs and one for orphaned singletons, so as to conserve all of the good data (though in practice the orphaned singletons file usually gets ignored unless there is a severe lack of coverage otherwise). Then map the clean files. This way saves a lot of processing and results in smaller intermediate files. And it's much easier, since pairs in the original fastq pairs will be together, and thus can be processed as a stream, whereas in a BAM they could be anywhere.

**Ohad** · 03-02-2014, 04:52 PM

Originally posted by Brian Bushnell View Post

Yes, unless that somehow biases the data or drops your coverage too low. Otherwise you'll be adding a lot of noise, and the noise may be biased in some way. Do you know why the %failed is so high?

Look at the mapping score, not the cigar string. Old-style cigar strings are worthless without context: M stands for Match or Mismatch. New-style cigar strings use '=' for match and 'X' for mismatch, and therefore actually give you useful information. Also, 2 is a "special" quality score for Illumina. 5 to 41 are normal Phred values, but 2 indicates something else (like, that two clusters are too close together to distinguish) and is usually much more accurate than a true Q2 base, up to a measured Q11~Q15 in my tests on HighSeq data. Bases with Q2 should not be used for error-sensitive things like assembly and variation calling, but they can increase alignment accuracy if the aligner is error-tolerant, because longer reads have less ambiguity. BWA is not very error-tolerant, though.

I just finished doing fastqc.
I see most reads are 50bp, but there are 40 and 36 as well.
Quality control is GREEN and the box-plot way of distribution shows a large amount of reads with the "special value" of 2.
I counted 58,180,456 of them using grep, not close enough to explain ~425M bad reads.

Looking for some cases I saw these three:

HWI-ST227_83:3:23:15749:55015 528 chr1 9995 25 50M * 0 165
TTACGATATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
##################################################
RG:Z:T0-008

HWI-ST227:145:C06RAACXX:1:1104:11867:197280 629 chr1 9996 0 * = 9996 129
TCTGATCTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGT
###########################################?2+A1:1
RG:Z:T0-010

HWI-ST227_83:4:65:8785:4808 512 chr1 10000 25 50M * 0 165
ATTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTATCCGT
##################################################
RG:Z:T0-009

Rolling down some few more lines on my screen I see flags 512 / 528 are quite common.

Perhaps those "2" quality lines are fine or explainable.

But another thought in mind is that it seems, by the read lengths and sequencing date, that this is an old data and perhaps the technology was still new and quality calling was less advanced.

I also see RED cross on 73.3% Duplication Level which is to be bothered by it, and some Kmer popping on some unique positions in the read. Though for Kmer I got the ORANGE escalation mark.

I do not recommend removing singletons from a paired dataset. Use a tool that can process paired reads in their original fastq, and if either fails your filtering criteria are, then remove both. I like to create 2 outputs, one for good pairs and one for orphaned singletons, so as to conserve all of the good data (though in practice the orphaned singletons file usually gets ignored unless there is a severe lack of coverage otherwise). Then map the clean files. This way saves a lot of processing and results in smaller intermediate files. And it's much easier, since pairs in the original fastq pairs will be together, and thus can be processed as a stream, whereas in a BAM they could be anywhere.

I don't have the original raw files, and it would be a problem for my small server to hold those files if I extract them back from BAM.

My current direction of thought is to break this BAM into Read Groups and start evaluating each group's sequence quality singly.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Cleaning MNase BAM file

Comment

Comment

Latest Articles

ad_right_rmr

News