Hello all,
I have observed an anomaly in all Illumina 1.5 pipeline data. I use Biopieces (www.biopieces.org) for trimming my data - and trim_seq basically removes residues below a given threshold from the ends. When I plot the length distribution after trimming I observe peaks for every 5 residues.
Forensics indicate that the presence of B in the quality scores is the problem. If I remove all records containing any B then the anomaly disappears:
And for all records with B's:
Now, according to Wikipedia:
and the docs I have been able to find (page 32):
B or Q2 is used as an indicator that a sequence residue quality is substandard, but don't really have a quality score. trim_seq will regard B as Q2 and discard the residue - and to the best of my understanding - that is OK.
But I don't understand the cyclic behaviour I observer. 10-20% of all records contain a B, so I will loose a lot of data by filtering those reads.
Anyone?
(and why do Illumina keep changing FASTQ encoding?)
Cheers,
Martin
I have observed an anomaly in all Illumina 1.5 pipeline data. I use Biopieces (www.biopieces.org) for trimming my data - and trim_seq basically removes residues below a given threshold from the ends. When I plot the length distribution after trimming I observe peaks for every 5 residues.
Code:
read_fastq -n 1000000 -i in.fastq | trim_seq | plot_lendist -k SEQ_LEN -x Length Distribution 12000 ++----------------------------------------------------------------++ | | | | 10000 ++ **+ | **| | **| 8000 ++ **+ | **| 6000 ++ ** **+ | ** **| | ** **| 4000 ++ * *** **+ | * *** ***| | ** *** ***** ***| 2000 ++ ** *** **** ***********+ | ** *** ***********************| |** ***** **** ******************************************| 0 +******************************************************************+ + + + + + + + + + + + 0 5 10 15 20 25 30 35 40 45 50
Forensics indicate that the presence of B in the quality scores is the problem. If I remove all records containing any B then the anomaly disappears:
Code:
read_fastq -n 1000000 -i in.fastq | grab -p B -k SCORES -i | trim_seq | plot_lendist -k SEQ_LEN -x Length Distribution 4000 ++-----------------------------------------------------------------++ | **| 3500 ++ **+ | **| 3000 ++ **+ | **| 2500 ++ **+ | **| 2000 ++ **+ | **| | **| 1500 ++ **+ | **| 1000 ++ **+ | **| 500 ++ **+ | ***| 0 ++------+------+-----+------+------+-----+------+------+-----+******+ + + + + + + + + + + + 0 5 10 15 20 25 30 35 40 45 50
Code:
read_fastq -n 1000000 -i in.fastq | grab -p B -k SCORES | trim_seq | plot_lendist -k SEQ_LEN -x Length Distribution 7000 ++-----------------------------------------------------------------++ | ** | 6000 ++ ** **+ | ** **| | ** **| 5000 ++ ** **+ | ** ** **| 4000 ++ ** *** **+ | ** *** ***| | ** *** ***| 3000 ++ * *** ***** ***+ | * *** ***** ****| 2000 ++ ** *** *** ***********+ | ** *** *** ****************| | *** *** ***********************| 1000 +** ** ** *** ************************************+ |** ********************************************************| 0 +*******************************************************************+ + + + + + + + + + + + 0 5 10 15 20 25 30 35 40 45 50
Now, according to Wikipedia:
and the docs I have been able to find (page 32):
B or Q2 is used as an indicator that a sequence residue quality is substandard, but don't really have a quality score. trim_seq will regard B as Q2 and discard the residue - and to the best of my understanding - that is OK.
But I don't understand the cyclic behaviour I observer. 10-20% of all records contain a B, so I will loose a lot of data by filtering those reads.
Anyone?
(and why do Illumina keep changing FASTQ encoding?)
Cheers,
Martin
Comment