SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
RTA v2.8 : Conflicts with low complexity sequence nickp Illumina/Solexa 2 06-04-2014 09:19 AM
programs for filtering low complexity swarbre Bioinformatics 5 02-05-2012 07:04 AM
Loss of data in low-diversity libraries can be recovered by deferred cluster calling fkrueger Bioinformatics 17 01-24-2012 05:29 PM
PE sequencing of a lib with ONE end high and the other low complexity ein_io Illumina/Solexa 4 12-01-2011 05:54 PM
Help:primer and low complexity sequence filter alvin1982 Illumina/Solexa 0 04-21-2010 07:05 PM

Reply
 
Thread Tools
Old 09-02-2011, 02:46 AM   #1
casbon
Junior Member
 
Location: Cambridge, UK

Join Date: Sep 2011
Posts: 7
Default Sequencing low complexity libraries: effects on data

I am planning some experiments that involve sequencing products that have a standard adaptor sequence at the start.

Now I know that the cluster identification occurs using bases 1-5 so I have thought about using a NNNNN after the sequencing primer. This should ensure that clusters are identified correctly.

However, for bases 6..15 all clusters have the same base. This will produce a single colour per flow, and there will potentially be optical effects due to saturation. Now, I don't really care about these bases, I am only interested in the genomic bases after the adaptor. So my question is: will the later bases be sequenced OK given that the early bases may have these problems?

Also, what will happen for the paired end read if that also has low complexity bases at the start? Since the cluster identification happens during the first read, the effect should be the same?
casbon is offline   Reply With Quote
Old 09-02-2011, 02:49 AM   #2
casbon
Junior Member
 
Location: Cambridge, UK

Join Date: Sep 2011
Posts: 7
Default

PS this thread was useful: http://seqanswers.com/forums/showthread.php?t=9150

but that deals with deferring cluster identification till after the low complexity bases. I want to know the effect of low complexity bases after a successful cluster identification.
casbon is offline   Reply With Quote
Old 09-02-2011, 04:07 AM   #3
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Hi casbon,

if all of your sequences have the same kind of adapter sequence at the start, can't you just avoid the whole low complexity issue by using a custom sequencing primer for that lane so that you start reading straight into the genomic sequence?

From our experience low complexity after the initial bases is not that much of a problem, and is certainly not nearly as bad as having it right at the start. If the same base composition would in general be much of a problem, then the shuffling process would not work very well, either. It does work quite well, even though the qualties do generally not quite reach the standards of a normal run (this is most likely due to phasing/prephasing though).

And yes paired-ends would only suffer slighlty from technical issues with basecalling, but not from any influence on cluster detection.
fkrueger is offline   Reply With Quote
Old 09-02-2011, 04:30 AM   #4
casbon
Junior Member
 
Location: Cambridge, UK

Join Date: Sep 2011
Posts: 7
Default

Thanks, fkrueger.

There are slight complications with dealing with a custom sequencing primer that I didn't disclose.

In light of your comments, I think I might just try a lane and see how it turns out.
casbon is offline   Reply With Quote
Old 09-02-2011, 04:34 AM   #5
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

In any case, if you could convince your sequencing provider to keep hold of the images of the run this might possibly help you if you want to reprocess the data, e.g. only including cycles 1-5 and 16-end for the basecalling procedure. Or bareback shuffling of the first 15 bp for that matter... Good luck!
fkrueger is offline   Reply With Quote
Old 09-02-2011, 07:29 AM   #6
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

The HiSeq doesn't save any of the images so the above suggestion would only work on the GAIIX.
NextGenSeq is offline   Reply With Quote
Old 09-05-2011, 06:38 AM   #7
huguesparri
Member
 
Location: Montpellier (France)

Join Date: May 2008
Posts: 93
Default

You can also try the following:
- increase the amount of Phix you're spiking in your library prior to hybridization on the flow cell. For some really low complexity libraries, you can go up to 50% PhiX. This should be really usefull when sequencing libraries where all your fragments start with the same bases.
- try to dilute your libraries a bit more than usual before you hybridize it on the flow cell (4 pM opposed to the usual 6 to 8 pM for example). You will end up with fewer sequences but you should avoid some of the identification problems.
Both these methods were given to us by Illumina's techsupport. We have tried the second one so far with some success and we are going to try the first one soon.
huguesparri is offline   Reply With Quote
Old 09-05-2011, 11:51 PM   #8
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

There are basically two problems with biased libraries. Firstly, a lack of diversity in the first few bases means that overlapping clusters aren't able to be separated so the region of measurement identified can span two clusters, leading to mixed signals when the sequences later diverge. Secondly the highly biased sequence composition messes up the signal intensity calibration so that the quality of called bases can suffer.

The solution to the first problem is to either dilute your library to the point where very few overlapping clusters are found, or to do the cluster calling from a later set of clusters, either by specifying the clusters to use when setting up the run (with a limited range of options), or by saving images and using something like bareback to shuffle the order in which they're presented to the cluster calling program.

The solution to the second problem is either to increase the diversity of your library through the introduction of more random sequences, or to use an external calibration, either a standard fixed one, or one derived from a different diverse lane on the same flowcell.

Adding PhiX attempts to solve both of these problems in one step - reducing the effective concentration of the biased library, and introducing some added diversity. Alternatively you could just dilute your library more and use a control lane elsewhere on the flowcell. Either of these approaches will yield substantially less data than a deferred cluster calling but they're much better than doing a standard analysis on a biased high density library which can, in extreme cases, return no data at all.

In your specific case, if you introduce random bases at the start so that the clusters are called correctly you may still find that all of your sequences end up being rejected due to the compositional bias later in the read. Actually the calls for your later bases will probably be OK, but one of the illumina filters looks for deteriorating quality and then flags all remaining bases with low quality scores, even if the quality later improves (the so called 'killer Bs'. You can turn this off using the undocumented parameter NO-EAMSS when processing which will preserve the original qualities. If you then trim your sequences to just your bases of interest then the qualities there should be OK.
simonandrews is offline   Reply With Quote
Reply

Tags
adaptor, illumina, low complexity

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:32 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO