SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
SNP base calling shuang Bioinformatics 7 10-24-2011 11:50 AM
SNP base calling for multiple samples shuang Bioinformatics 2 09-07-2011 02:06 PM
Editing fasta , reference base in snp calling samtools moriah Bioinformatics 2 08-09-2011 11:11 PM
Mapping and base calling atgc Bioinformatics 7 06-20-2011 12:24 PM
base composition variation in Illumina runsH chrisbala Bioinformatics 4 09-07-2010 01:30 PM

Reply
 
Thread Tools
Old 07-29-2011, 05:39 AM   #1
arolfe
Member
 
Location: 02119

Join Date: Jul 2011
Posts: 29
Default base composition and base calling

I've read that the illumina basecalling software has problems calibrating itself if the base composition in the first few bases of the reads isn't a roughly equal mix of nucleotides. We're thinking of sequencing constructs that begin with our own barcode and were wondering what the parameters are for correct base calling:

- how many positions are used to calibrate?
- what are the bounds on acceptable nucleotide mixture? eg, how far off from 25% each can you be?
- I believe you can calibrate on something other than the first four bases. How far into the read can you wait to calibrate?
- can different lanes in a run be calibrated differently? eg, if our sample is one lane of a run, does that make this easier or harder for the sequencing facilitiy?
- does any of this vary between the GAII and HiSeq?

Thanks!
Alex
arolfe is offline   Reply With Quote
Old 07-29-2011, 06:49 AM   #2
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi Alex,

To my knowledge the Illumina pipeline performs its crosstalk matrix and phasing/prephasing calibration during the first 4 cycles by default, and this can be altered with --matrix-cycles=n. Similar to using many cycles for cluster detection this will probably mean that the workstation PC will need to store more Images until the intitial calibration calculations are done, at which point the real-time data analysis will start. Using lots of cycles will cause a back-log on the workstation, but this should be manageable for at least 10 or so cycles I would think (at elast on a GA, not so sure about the HiSeq as it generates so much more data).

You can avoid these problems by specifying a control-lane with a relatively normal base composition (--control-lane=..), such as a lane of PhiX or whole genome shotgun sequencing. Alternatively it is also possible not to perform calibration on the sample and use a pre-formatted calibration table (probably slightly different ones for GA and HiSeq).

Something else you should consider is that you might potentially lose a certain amount of data because the cluster detection does not work normally if you have low-diversity at the start of sequences, and this is completely independent of a skewed base composition. This depends mainly on the number of barcodes you have in your sample, and the cluster density. In summary, the fewer barcodes and the higher you cluster density the more data you are likely going to lose. Please refer to this post for more information (http://seqanswers.com/forums/showthr...light=bareback), or send me an email if you have any further questions.
fkrueger is offline   Reply With Quote
Old 07-29-2011, 07:50 AM   #3
arolfe
Member
 
Location: 02119

Join Date: Jul 2011
Posts: 29
Default

Thank you! This plus your paper is very helpful.
arolfe is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:45 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO