11-17-2016, 10:47 AM
Brian Bushnell
Unfortunately, Illumina's taken a turn for the worse again. I just analyzed some recent data from the NextSeq, HiSeq2500, and HiSeq 1T platforms of the same library. The NextSeq data is dramatically worse than last time I looked at it. Error rates are several times higher, there's a major A/T base frequency divergence in read 2, and the quality scores are inflated again at ~6 points higher than the actual quality. More disturbingly, the HiSeq quality scores are completely inaccurate now, as well, though the actual measured quality is still very high - average Q33 for read 1 and Q29 for read 2 for HiSeq2500, versus Q24 for read 1 and Q18 for read 2 on the NextSeq (those numbers are as measured by counting the match/mismatch rates from mapping, so essentially, NextSeq has roughly 10X the error rate of HiSeq). But the measured discrepancy between claimed and measured quality scores for the HiSeq2500 and HiSeq 1T are BOTH worse than the NextSeq, despite the NextSeq having binned quality scores, and as you can see there are large regions of quality scores simply missing from the HiSeq2500, such as Q3-Q11, Q17-Q21, and Q29. There are clearly major problems with Illumina's current base-calling software, as quality score assignment has drastically regressed since last time I measured it.

You can see the graphs in this Excel sheet that I've linked. "Raw" is the raw data, "Recal" is after recalibration (which changes the quality scores but nothing else). "NS" is NextSeq, "2500" is HiSeq2500, and "1T" is HiSeq 1T which unfortunately was only run at 2x101bp instead of 2x151bp on the other 2 platforms.

