SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sequencing a Low diversity library on the HiSeq Simcom Illumina/Solexa 38 09-26-2012 06:01 AM
trimmed DNA loss SNP calling johnsequence Bioinformatics 0 10-26-2011 11:51 AM
Sequencing low complexity libraries: effects on data casbon Illumina/Solexa 7 09-06-2011 12:51 AM
Two lane failures and low cluster density - expired reagents? Turnerac0987 Illumina/Solexa 2 08-25-2011 11:46 AM
RNAseq - low cluster density - possible inhibitor? OnUrMark Sample Prep / Library Generation 11 05-07-2011 08:05 PM

Reply
 
Thread Tools
Old 01-29-2011, 01:29 PM   #1
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default Loss of data in low-diversity libraries can be recovered by deferred cluster calling

It has been reported several times that low diversity at the start of Illumina sequencing libraries can lead to a large scale loss of data because the standard pipeline will get the initial cluster identification wrong. Researchers at our institute generated such low-diversity libraries on numerous occasions, including libraries which were digested with restriction enzymes prior to sequencing, libraries with custom barcode tags at the start of all sequences, RRBS and so on. We have developed a simple method called barcode-back-processing (or short bareback-processing) which allows the deferral of cluster identification to later cycles, and are happy to announce that this study has just been published in PLoS One (bareback manuscript).

With this study we would like to raise awareness within the sequencing community that certain types of experiments can be associated with tremendous problems on the Illumina platform (up to complete failures of entire sequencing lanes), and report a potential fix for this problem. We are aware that constant new software releases and hardware requirements might mean that the solution we present is only temporary, but hopefully our findings will be motivational for Illumina to include a proper solution to low-diversity libraries in one of their future pipeline versions.

Our method does in theory something comparable to the unofficial and undocumented Illumina option --image-flags, which has been mentioned here on Seqanswers before (undocumented option --image_flags, changes in Illumina Pipeline SCS 2.8/RTA 1.8, multiplexing on HiSeq). We have done a couple of comparisons between different standard Illumina pipeline versions, bareback and --image-flags processing with data from real world datasets from ongoing research projects. In some extreme cases, bareback-processing was able to recover more than 33 million good quality sequencing reads from a lane which produced literally 0 sequences with the standard Illumina pipeline SCS v2.6/OLB v1.6 processing and around 30,000 sequences with OLB v1.8 processing (see Supplementary Figures 1 and 2 for comparisons between --image-flags and bareback-processing). Interestingly, --image-flags was also very good at recovering extra sequences, however we found that something odd is going on when this undocumented option is used, as a quite large percentage of reads contains poor quality base calls and/or many more Ns in the sequences. In summary, it appears that bareback-processing often produces more high quality reads than the built-in but undocumented option --image-flags.

Bareback-processing works by moving the raw cluster images files containing the intially biased sequences to the end of the reads before invoking the Illumina pipeline. After the analysis has been completed, the cycles containing the low-diversity sequences are moved back to the start of the sequence reads. This of course implies that it can only be applied if the actual image files are being stored (so it will not work for HiSeq machines, even though they will still suffer from exactly the same problems!). For Illumina GAIIx machines one either needs to run SCS v2.6 (which allows storing image information) and reprocess then from the images with preferably OLB 1.8 (although this option will soon be unavailable), or upgrade the instrument PC hardware to at least a T7500. It will be interesting to see what future versions of the Illumina pipeline are going to offer...

The images of two lanes will soon be available for download from the SRA archive, one being a well diverse control library (PhiX), the other being a library with very low initial diversity (all sequences are supposed to have the first 12 bp in common) (lanes 1 and 4 from Supplementary Figure 2).

If you have any questions or comments please get in touch!
Attached Files
File Type: pdf Supplementary Figure 1.pdf (2.15 MB, 149 views)
File Type: pdf Supplementary Figure 2.pdf (2.36 MB, 76 views)
fkrueger is offline   Reply With Quote
Old 01-29-2011, 05:19 PM   #2
GERALD
Member
 
Location: San Francisco, CA

Join Date: Jun 2010
Posts: 20
Default bareback

Actually, I have tried this myself and found it to be true. I just made a perl script to copy all the files and rename them (called it goatfooler). Then, I ran CASAVA and used another script to copy the tags and qscores to the front of the call. After recalling them, my base calling went from utter failure to complete success. I actually tried the undocumented --image-flags option and, just as you described, it didn't work very well. My Illumina rep was utterly baffled by my results. It would be really nice if Illumina provided more documentation of how they do their basecalling. I'm glad to hear that someone else obtained similar results from their analysis.
GERALD is offline   Reply With Quote
Old 04-05-2011, 04:53 AM   #3
C.R.
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 25
Default

I strongly agree. This is a big problem and Illumina does not pay attention to it. In general my libraries are OK, since it worked for one test run on a Genome Analyzer. Now I got 5 RRBS samples sequenced on a HiScanSQ but all reads are trash due to the problem which is nicely described in your paper. The Illumina tech-support did not help so far. Now since more than a week they only keep telling us that there was no technical problem during sequencing. Well, this is true, because the control lane and 2 Lanes ChIP-Seq are OK. Unfortunately, it seems that no high resolution images have been recorded, such that I cannot use your software. Thank you very much for your helpful comments so far Felix!
Is there anybody else who can tell me what needs to be considered for a successful Illumina HiSeq / HiScanSQ sequencing of RRBS libraries?
C.R. is offline   Reply With Quote
Old 04-27-2011, 11:28 AM   #4
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

We just had this same issue with our HiSeq 2000. How can we reanalyze this without the image files? Can this be done using the CIF files?
NextGenSeq is offline   Reply With Quote
Old 04-27-2011, 11:31 AM   #5
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

I am afraid this won't work if you don't have the saved images. Did you lose entire lanes or just a certain fraction of it?

Last edited by fkrueger; 04-27-2011 at 11:37 AM.
fkrueger is offline   Reply With Quote
Old 04-27-2011, 11:56 AM   #6
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

A fraction, the data quality drops off quickly after the barcode.

It's infuriating that Illumina has done nothing about this when they've known about this for years.
NextGenSeq is offline   Reply With Quote
Old 04-27-2011, 03:30 PM   #7
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 505
Default

I'll be the first to admit that Illumina has made some mistakes (for example, generating a file format that its aligner cannot read), and they could do a better job of advertising the issue, but the decision not to save the image files seems a reasonable trade off (although it would be nice to have the option to save). Transferring the images to the server had become the bottleneck for sequencing runs, and the problem was exacerbated when they rolled out the HiSeq. There are a couple of straightforward non-computational solutions: use custom sequencing primers if there's no diversity, or design multiple balanced barcodes for each sample to introduce diversity.
HESmith is offline   Reply With Quote
Old 04-28-2011, 04:42 AM   #8
protist
Senior Member
 
Location: Ireland

Join Date: Jan 2009
Posts: 101
Default

Has anyone tried the "Configurable Template Generation Cycles option" in the new SCS2.9/RTA1.9 when running indexed samples on a GAIIx. It allows deferred cluster calling for low complexity or in adapter bar-coded samples. We have got the script from our FAS but have not tried it as yet....wondering if there is anyone out there who has?

[I]From SCS2.9/RTA1.9 Release notes:
Configurable Template Generation Cycles: The SCS CIF file generation feature cannot start until RTA has generated the tile templates. This
takes 5 cycles after the declared template generation cycle.
Normally template generation begins on cycle 1 and ends on cycle 5. However template generation requires a diversity of bases in the clusters of the template generation cycles. Some users have custom sample preparation procedures that place arbitrary sequences on the clusters, adapters or indexing ““spikes””, etc. The required diversity of bases may not be present in this case, and it is possible to delay template generation until the actual sample is being sequenced.
[/I]
protist is offline   Reply With Quote
Old 04-28-2011, 05:30 AM   #9
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Quote:
Originally Posted by protist View Post
Has anyone tried the "Configurable Template Generation Cycles option" in the new SCS2.9/RTA1.9 when running indexed samples on a GAIIx. It allows deferred cluster calling for low complexity or in adapter bar-coded samples. We have got the script from our FAS but have not tried it as yet....wondering if there is anyone out there who has?

I would also be interested if anyone had used this "new" option. After talking to our Illumina rep we don't have any reason to believe that the "Configurable Template Generation Cycles" option is any different from the previous unofficial option "--image-flags". Thus, I would imagine that the basecalls would still suffer from mysteriously bad qualities, see the Supplementary Figures linked in the first post of this thread.

Not quite but I also think that this option can only be applied to the entire flowcell and not on a per-lane basis, right?
fkrueger is offline   Reply With Quote
Old 06-06-2011, 08:43 PM   #10
DNAANDDAN
Junior Member
 
Location: china

Join Date: Jan 2010
Posts: 2
Default how about PE data

Hi, I have the same issue with my data. however , in my data , which is paired-end manner of solexa data ,1-81 are read1 data,and 82-162 are read2 data , 1-7 and 82-88 cycles are barcode with low diversety .
could bareback handle this kind of data ?
DNAANDDAN is offline   Reply With Quote
Old 06-07-2011, 12:05 AM   #11
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Hi Lan,

Yes, in theory bareback-processing should be able to handle this kind of data. Cluster coordinates are determined for read 1 only, so it will be sufficient if you shuffle the first 7 bp or read 1 towards the back and leave read 2 untouched (the bareback-script will do just that).

Good luck!
fkrueger is offline   Reply With Quote
Old 09-21-2011, 01:54 PM   #12
Horacio G
Junior Member
 
Location: Little Rock

Join Date: Nov 2010
Posts: 1
Smile First try on low-diversity libraries

Hi guys,

I'm trying to run my first flow cell on a GAIIx with low-diversity libraries. I'm still not sure whether to go ahead and save the images and do the post analysis with Bareback (my illumina rep does not encourage that alternative) or to use the delay template generation. However on the latter I don't know if I'll get an early report about the quality of the run (i.e. focusing, intensities ).
Any suggestions would be greatly appreciated.

Horacio
Horacio G is offline   Reply With Quote
Old 09-21-2011, 03:35 PM   #13
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Hi Horacio,

Why am I not surprised that your rep does not recommend anything but using the standard pipeline... If you've got the option to save the images I would definitely vote for that. If you still have the images you can choose to use the standard pipeline, use --image-flags (which is the Illumina deferred cluster calling option) or even bareback processing. However if you don't save images you will have to go with whatever the standard analysis pipeline will give you (and this can be shocking (0 sequences in the worst case scenario which we experienced several times)... but this highly depends on your experimental setup, the number of low diversity sequences, the cluster density and so on).

If you have further questions don't hesitate to ask via email.

Best,
Felix
fkrueger is offline   Reply With Quote
Old 09-22-2011, 06:10 AM   #14
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by HESmith View Post
[...] the decision not to save the image files seems a reasonable trade off (although it would be nice to have the option to save). Transferring the images to the server had become the bottleneck for sequencing runs, and the problem was exacerbated when they rolled out the HiSeq. [...]
There is an option to save the images. We tried it out on a recent run. This is using the standard HiSeq run software and v3 chemistry. 6.24 TB of TIFFs for a 2x101+7 run. (PE + index). That was only 1 surface of one flow cell though. So it would be 2x or 4x more for a HiSeq 1000 or HiSeq 2000. Also we save the runs to an offsite server during the run -- not the console machine itself.

What? You don't have 25 TBs handy to store image data?

What are you going to do with it? You can tell the instrument console (a Dell server running Windows Vista) to reprocess the image data. But that is going to be a slow process. You probably don't want to tie up your instrument that long reprocessing a run. Maybe clone the console server into a virtual machine and run it off-site?

--
Phillip
pmiguel is offline   Reply With Quote
Old 09-22-2011, 06:21 AM   #15
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 622
Default

Thanks for this piece of information Phillip, so far the general consensus seemed to be that it is absolutely impossible to store image data (apart from thumbnails) from the HiSeq (probably also the HighScan then) at all. Storing this amount of data let alone reprocessing a whole flowcell (which would likely take a couple of days) is a whole different matter, though...
fkrueger is offline   Reply With Quote
Old 09-22-2011, 12:30 PM   #16
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

I should add that Illumina deserves a lot of credit for how they handle storing data off-site. We have a gigabit line to our storage server and just mapped that server as a samba share. But, in testing, cutting off the samba connection in the midst of the run did not even generate an error. The instrument appeared to seamlessly fail over to saving the data locally. Then, when the connection was restored the data was gradually transferred.

I do not know that the Dell console computer even has enough storage locally to save the raw .TIFFs. I think it has a RAID of four 2 TB drives. So, to a first approximation, that would mean that saving the image data for a run is simply impossible without an off-site storage solution.

--
Phillip
--
Phillip
pmiguel is offline   Reply With Quote
Old 11-01-2011, 06:17 AM   #17
rallapag
Junior Member
 
Location: UK

Join Date: Aug 2008
Posts: 2
Default

Manual configuration of "Configurable Template Generation Cycles option" in the SCS2.9/RTA1.9; any update on this?? anyone tried it??

I do understand Felix Krueger's explanation on this (comparison to "bareback")
Quote:
Originally Posted by fkrueger View Post
After talking to our Illumina rep we don't have any reason to believe that the "Configurable Template Generation Cycles" option is any different from the previous unofficial option "--image-flags". Thus, I would imagine that the basecalls would still suffer from mysteriously bad qualities, see the Supplementary Figures linked in the first post of this thread.
can the following post from Simon Andrews, explains the reason for predominance of bad qualities from Illumina pipeline
Quote:
Originally Posted by simonandrews View Post
...........one of the illumina filters looks for deteriorating quality and then flags all remaining bases with low quality scores, even if the quality later improves (the so called 'killer Bs'. You can turn this off using the undocumented parameter NO-EAMSS when processing which will preserve the original qualities. If you then trim your sequences to just your bases of interest then the qualities there should be OK.

Last edited by rallapag; 11-01-2011 at 07:06 AM.
rallapag is offline   Reply With Quote
Old 01-24-2012, 06:29 PM   #18
amitra
Junior Member
 
Location: galveston

Join Date: May 2009
Posts: 8
Default

Hi Guys!
I am very interested in this problem, since we have been unsuccessful in sequencing a sample with a custom barcode 'TCGAGGTAGTA' attached, probably due to lack of diversity.

I may get a one time opportunity to visit and potentially work with our sequencing facility for re-sequencing of our sample.
We would have an opportunity to change some default options on RTA software.

Can the experts please guide us as to the best way to go with a HiSeq 2000 sequencer with a single flow cell 50bp run?

I will carry with me 5x3TB hard drive to the facility and if needed configure it as a RAID0 or RAID5 on the RTA machine to store TIFF files.
amitra is offline   Reply With Quote
Reply

Tags
cluster identification, data loss, low complexity, low diversity, sequence bias

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:27 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO