View Single Post
Old 03-02-2014, 03:52 PM   #3
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default

Quote:
Originally Posted by Brian Bushnell View Post
Yes, unless that somehow biases the data or drops your coverage too low. Otherwise you'll be adding a lot of noise, and the noise may be biased in some way. Do you know why the %failed is so high?

Look at the mapping score, not the cigar string. Old-style cigar strings are worthless without context: M stands for Match or Mismatch. New-style cigar strings use '=' for match and 'X' for mismatch, and therefore actually give you useful information. Also, 2 is a "special" quality score for Illumina. 5 to 41 are normal Phred values, but 2 indicates something else (like, that two clusters are too close together to distinguish) and is usually much more accurate than a true Q2 base, up to a measured Q11~Q15 in my tests on HighSeq data. Bases with Q2 should not be used for error-sensitive things like assembly and variation calling, but they can increase alignment accuracy if the aligner is error-tolerant, because longer reads have less ambiguity. BWA is not very error-tolerant, though.
I just finished doing fastqc.
I see most reads are 50bp, but there are 40 and 36 as well.
Quality control is GREEN and the box-plot way of distribution shows a large amount of reads with the "special value" of 2.
I counted 58,180,456 of them using grep, not close enough to explain ~425M bad reads.

Looking for some cases I saw these three:


HWI-ST227_83:3:23:15749:55015 528 chr1 9995 25 50M * 0 165
TTACGATATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
##################################################
RG:Z:T0-008

HWI-ST227:145:C06RAACXX:1:1104:11867:197280 629 chr1 9996 0 * = 9996 129
TCTGATCTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGT
###########################################?2+A1:1
RG:Z:T0-010

HWI-ST227_83:4:65:8785:4808 512 chr1 10000 25 50M * 0 165
ATTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTATCCGT
##################################################
RG:Z:T0-009

Rolling down some few more lines on my screen I see flags 512 / 528 are quite common.

Perhaps those "2" quality lines are fine or explainable.

But another thought in mind is that it seems, by the read lengths and sequencing date, that this is an old data and perhaps the technology was still new and quality calling was less advanced.

I also see RED cross on 73.3% Duplication Level which is to be bothered by it, and some Kmer popping on some unique positions in the read. Though for Kmer I got the ORANGE escalation mark.

Quote:
I do not recommend removing singletons from a paired dataset. Use a tool that can process paired reads in their original fastq, and if either fails your filtering criteria are, then remove both. I like to create 2 outputs, one for good pairs and one for orphaned singletons, so as to conserve all of the good data (though in practice the orphaned singletons file usually gets ignored unless there is a severe lack of coverage otherwise). Then map the clean files. This way saves a lot of processing and results in smaller intermediate files. And it's much easier, since pairs in the original fastq pairs will be together, and thus can be processed as a stream, whereas in a BAM they could be anywhere.
I don't have the original raw files, and it would be a problem for my small server to hold those files if I extract them back from BAM.

My current direction of thought is to break this BAM into Read Groups and start evaluating each group's sequence quality singly.

Last edited by Ohad; 03-02-2014 at 04:13 PM.
Ohad is offline   Reply With Quote