SEQanswers (
-   Illumina/Solexa (
-   -   base composition and base calling (

arolfe 07-29-2011 06:39 AM

base composition and base calling
I've read that the illumina basecalling software has problems calibrating itself if the base composition in the first few bases of the reads isn't a roughly equal mix of nucleotides. We're thinking of sequencing constructs that begin with our own barcode and were wondering what the parameters are for correct base calling:

- how many positions are used to calibrate?
- what are the bounds on acceptable nucleotide mixture? eg, how far off from 25% each can you be?
- I believe you can calibrate on something other than the first four bases. How far into the read can you wait to calibrate?
- can different lanes in a run be calibrated differently? eg, if our sample is one lane of a run, does that make this easier or harder for the sequencing facilitiy?
- does any of this vary between the GAII and HiSeq?


fkrueger 07-29-2011 07:49 AM

Hi Alex,

To my knowledge the Illumina pipeline performs its crosstalk matrix and phasing/prephasing calibration during the first 4 cycles by default, and this can be altered with --matrix-cycles=n. Similar to using many cycles for cluster detection this will probably mean that the workstation PC will need to store more Images until the intitial calibration calculations are done, at which point the real-time data analysis will start. Using lots of cycles will cause a back-log on the workstation, but this should be manageable for at least 10 or so cycles I would think (at elast on a GA, not so sure about the HiSeq as it generates so much more data).

You can avoid these problems by specifying a control-lane with a relatively normal base composition (--control-lane=..), such as a lane of PhiX or whole genome shotgun sequencing. Alternatively it is also possible not to perform calibration on the sample and use a pre-formatted calibration table (probably slightly different ones for GA and HiSeq).

Something else you should consider is that you might potentially lose a certain amount of data because the cluster detection does not work normally if you have low-diversity at the start of sequences, and this is completely independent of a skewed base composition. This depends mainly on the number of barcodes you have in your sample, and the cluster density. In summary, the fewer barcodes and the higher you cluster density the more data you are likely going to lose. Please refer to this post for more information (, or send me an email if you have any further questions.

arolfe 07-29-2011 08:50 AM

Thank you! This plus your paper is very helpful.

All times are GMT -8. The time now is 03:31 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.