Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Loss of data in low-diversity libraries can be recovered by deferred cluster calling

    It has been reported several times that low diversity at the start of Illumina sequencing libraries can lead to a large scale loss of data because the standard pipeline will get the initial cluster identification wrong. Researchers at our institute generated such low-diversity libraries on numerous occasions, including libraries which were digested with restriction enzymes prior to sequencing, libraries with custom barcode tags at the start of all sequences, RRBS and so on. We have developed a simple method called barcode-back-processing (or short bareback-processing) which allows the deferral of cluster identification to later cycles, and are happy to announce that this study has just been published in PLoS One (bareback manuscript).

    With this study we would like to raise awareness within the sequencing community that certain types of experiments can be associated with tremendous problems on the Illumina platform (up to complete failures of entire sequencing lanes), and report a potential fix for this problem. We are aware that constant new software releases and hardware requirements might mean that the solution we present is only temporary, but hopefully our findings will be motivational for Illumina to include a proper solution to low-diversity libraries in one of their future pipeline versions.

    Our method does in theory something comparable to the unofficial and undocumented Illumina option --image-flags, which has been mentioned here on Seqanswers before (undocumented option --image_flags, changes in Illumina Pipeline SCS 2.8/RTA 1.8, multiplexing on HiSeq). We have done a couple of comparisons between different standard Illumina pipeline versions, bareback and --image-flags processing with data from real world datasets from ongoing research projects. In some extreme cases, bareback-processing was able to recover more than 33 million good quality sequencing reads from a lane which produced literally 0 sequences with the standard Illumina pipeline SCS v2.6/OLB v1.6 processing and around 30,000 sequences with OLB v1.8 processing (see Supplementary Figures 1 and 2 for comparisons between --image-flags and bareback-processing). Interestingly, --image-flags was also very good at recovering extra sequences, however we found that something odd is going on when this undocumented option is used, as a quite large percentage of reads contains poor quality base calls and/or many more Ns in the sequences. In summary, it appears that bareback-processing often produces more high quality reads than the built-in but undocumented option --image-flags.

    Bareback-processing works by moving the raw cluster images files containing the intially biased sequences to the end of the reads before invoking the Illumina pipeline. After the analysis has been completed, the cycles containing the low-diversity sequences are moved back to the start of the sequence reads. This of course implies that it can only be applied if the actual image files are being stored (so it will not work for HiSeq machines, even though they will still suffer from exactly the same problems!). For Illumina GAIIx machines one either needs to run SCS v2.6 (which allows storing image information) and reprocess then from the images with preferably OLB 1.8 (although this option will soon be unavailable), or upgrade the instrument PC hardware to at least a T7500. It will be interesting to see what future versions of the Illumina pipeline are going to offer...

    The images of two lanes will soon be available for download from the SRA archive, one being a well diverse control library (PhiX), the other being a library with very low initial diversity (all sequences are supposed to have the first 12 bp in common) (lanes 1 and 4 from Supplementary Figure 2).

    If you have any questions or comments please get in touch!
    Attached Files

  • #2
    bareback

    Actually, I have tried this myself and found it to be true. I just made a perl script to copy all the files and rename them (called it goatfooler). Then, I ran CASAVA and used another script to copy the tags and qscores to the front of the call. After recalling them, my base calling went from utter failure to complete success. I actually tried the undocumented --image-flags option and, just as you described, it didn't work very well. My Illumina rep was utterly baffled by my results. It would be really nice if Illumina provided more documentation of how they do their basecalling. I'm glad to hear that someone else obtained similar results from their analysis.

    Comment


    • #3
      I strongly agree. This is a big problem and Illumina does not pay attention to it. In general my libraries are OK, since it worked for one test run on a Genome Analyzer. Now I got 5 RRBS samples sequenced on a HiScanSQ but all reads are trash due to the problem which is nicely described in your paper. The Illumina tech-support did not help so far. Now since more than a week they only keep telling us that there was no technical problem during sequencing. Well, this is true, because the control lane and 2 Lanes ChIP-Seq are OK. Unfortunately, it seems that no high resolution images have been recorded, such that I cannot use your software. Thank you very much for your helpful comments so far Felix!
      Is there anybody else who can tell me what needs to be considered for a successful Illumina HiSeq / HiScanSQ sequencing of RRBS libraries?

      Comment


      • #4
        We just had this same issue with our HiSeq 2000. How can we reanalyze this without the image files? Can this be done using the CIF files?

        Comment


        • #5
          I am afraid this won't work if you don't have the saved images. Did you lose entire lanes or just a certain fraction of it?
          Last edited by fkrueger; 04-27-2011, 10:37 AM.

          Comment


          • #6
            A fraction, the data quality drops off quickly after the barcode.

            It's infuriating that Illumina has done nothing about this when they've known about this for years.

            Comment


            • #7
              I'll be the first to admit that Illumina has made some mistakes (for example, generating a file format that its aligner cannot read), and they could do a better job of advertising the issue, but the decision not to save the image files seems a reasonable trade off (although it would be nice to have the option to save). Transferring the images to the server had become the bottleneck for sequencing runs, and the problem was exacerbated when they rolled out the HiSeq. There are a couple of straightforward non-computational solutions: use custom sequencing primers if there's no diversity, or design multiple balanced barcodes for each sample to introduce diversity.

              Comment


              • #8
                Has anyone tried the "Configurable Template Generation Cycles option" in the new SCS2.9/RTA1.9 when running indexed samples on a GAIIx. It allows deferred cluster calling for low complexity or in adapter bar-coded samples. We have got the script from our FAS but have not tried it as yet....wondering if there is anyone out there who has?

                [I]From SCS2.9/RTA1.9 Release notes:
                Configurable Template Generation Cycles: The SCS CIF file generation feature cannot start until RTA has generated the tile templates. This
                takes 5 cycles after the declared template generation cycle.
                Normally template generation begins on cycle 1 and ends on cycle 5. However template generation requires a diversity of bases in the clusters of the template generation cycles. Some users have custom sample preparation procedures that place arbitrary sequences on the clusters, adapters or indexing ““spikes””, etc. The required diversity of bases may not be present in this case, and it is possible to delay template generation until the actual sample is being sequenced.
                [/I]

                Comment


                • #9
                  Originally posted by protist View Post
                  Has anyone tried the "Configurable Template Generation Cycles option" in the new SCS2.9/RTA1.9 when running indexed samples on a GAIIx. It allows deferred cluster calling for low complexity or in adapter bar-coded samples. We have got the script from our FAS but have not tried it as yet....wondering if there is anyone out there who has?

                  I would also be interested if anyone had used this "new" option. After talking to our Illumina rep we don't have any reason to believe that the "Configurable Template Generation Cycles" option is any different from the previous unofficial option "--image-flags". Thus, I would imagine that the basecalls would still suffer from mysteriously bad qualities, see the Supplementary Figures linked in the first post of this thread.

                  Not quite but I also think that this option can only be applied to the entire flowcell and not on a per-lane basis, right?

                  Comment


                  • #10
                    how about PE data

                    Hi, I have the same issue with my data. however , in my data , which is paired-end manner of solexa data ,1-81 are read1 data,and 82-162 are read2 data , 1-7 and 82-88 cycles are barcode with low diversety .
                    could bareback handle this kind of data ?

                    Comment


                    • #11
                      Hi Lan,

                      Yes, in theory bareback-processing should be able to handle this kind of data. Cluster coordinates are determined for read 1 only, so it will be sufficient if you shuffle the first 7 bp or read 1 towards the back and leave read 2 untouched (the bareback-script will do just that).

                      Good luck!

                      Comment


                      • #12
                        First try on low-diversity libraries

                        Hi guys,

                        I'm trying to run my first flow cell on a GAIIx with low-diversity libraries. I'm still not sure whether to go ahead and save the images and do the post analysis with Bareback (my illumina rep does not encourage that alternative) or to use the delay template generation. However on the latter I don't know if I'll get an early report about the quality of the run (i.e. focusing, intensities ).
                        Any suggestions would be greatly appreciated.

                        Horacio

                        Comment


                        • #13
                          Hi Horacio,

                          Why am I not surprised that your rep does not recommend anything but using the standard pipeline... If you've got the option to save the images I would definitely vote for that. If you still have the images you can choose to use the standard pipeline, use --image-flags (which is the Illumina deferred cluster calling option) or even bareback processing. However if you don't save images you will have to go with whatever the standard analysis pipeline will give you (and this can be shocking (0 sequences in the worst case scenario which we experienced several times)... but this highly depends on your experimental setup, the number of low diversity sequences, the cluster density and so on).

                          If you have further questions don't hesitate to ask via email.

                          Best,
                          Felix

                          Comment


                          • #14
                            Originally posted by HESmith View Post
                            [...] the decision not to save the image files seems a reasonable trade off (although it would be nice to have the option to save). Transferring the images to the server had become the bottleneck for sequencing runs, and the problem was exacerbated when they rolled out the HiSeq. [...]
                            There is an option to save the images. We tried it out on a recent run. This is using the standard HiSeq run software and v3 chemistry. 6.24 TB of TIFFs for a 2x101+7 run. (PE + index). That was only 1 surface of one flow cell though. So it would be 2x or 4x more for a HiSeq 1000 or HiSeq 2000. Also we save the runs to an offsite server during the run -- not the console machine itself.

                            What? You don't have 25 TBs handy to store image data?

                            What are you going to do with it? You can tell the instrument console (a Dell server running Windows Vista) to reprocess the image data. But that is going to be a slow process. You probably don't want to tie up your instrument that long reprocessing a run. Maybe clone the console server into a virtual machine and run it off-site?

                            --
                            Phillip

                            Comment


                            • #15
                              Thanks for this piece of information Phillip, so far the general consensus seemed to be that it is absolutely impossible to store image data (apart from thumbnails) from the HiSeq (probably also the HighScan then) at all. Storing this amount of data let alone reprocessing a whole flowcell (which would likely take a couple of days) is a whole different matter, though...

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X