Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • OLB - basecalling on HiSeq data extremely slow

    Hi,

    I am not really happy with the current external basecalling pipeline.

    Hardware: 48 cores (AMD Opteron Processor 6176 SE), 256GB RAM
    System: Linux, 64bit, make 3.82, OLB 1.9.3

    No other load on the machine, data is stored locally (xfs).

    Apart from the fact that 'make' is probably the wrong tool for this task,
    what are your experiences with external basecalling using OLB?

    For a current Hiseq run (2x101bp+index, roughly 160K CIF files) 'make'
    needs appr. 4 days to read the makefile and its includes (preparing the
    execution of the actual basecalling step) using one CPU. When it has
    finished the actual basecalling starts multithreaded and is usually finished
    overnight (a few hours).

    Any comments on that?

    thanks,
    Sven
    Last edited by sklages; 03-05-2013, 07:16 AM. Reason: typo

  • #2
    I am mildly curious as to why you are still using OLB off-line?

    Re-read your message again. Just the process of reading the makefiles is taking 4 days? How big are these files?
    Last edited by GenoMax; 03-05-2013, 09:09 AM.

    Comment


    • #3
      Originally posted by GenoMax View Post
      I am mildly curious as to why you are still using OLB off-line?

      Re-read your message again. Just the process of reading the makefiles is taking 4 days? How big are these files?
      We sometimes have libraries with low complexity (e.g. with barcodes at the beginning of read 1 or some ChiPseq experiments);
      using OLB with a "sufficient complex" reference lane from the same run usually results in an increase of data and quality.

      Yes, 'make' takes a long time resolving the dependencies in the makefile (and its includes). This seems to be a very CPU-bound process,
      it produces not much I/O. But as this is still the preparation of the actual parallelized basecalling step, it runs on one CPU only.

      Have a look at the makefile(s); not "debuggable" :-)

      Reading (stat()) all the CIFs (160K files) and building the data structure for the jobs to be executed gets incredibly slow the more files are read.

      This might be a problem of 'make' (though it is not the right tool IMHO) or maybe a bug in the makefile and its dependencies/includes. Hard to tell :-)

      I was just wondering how others are experiencing OLB in terms of speed; especially the make part.

      Of course, any other ideas how to deal with "low complexity" lanes are welcome.
      I was usually going with Bustard/OLB with good results in these cases.

      I am looking forward for RTA to be used externally; hopefully Illumina does not employ 'make' for this task as well ...

      ... analyzing the makefiles, the structure, the order, the executables, and then writing an independent wrapper for external basecalling .. is then probably the best way to go

      Comment


      • #4
        Originally posted by sklages View Post
        We sometimes have libraries with low complexity (e.g. with barcodes at the beginning of read 1 or some ChiPseq experiments);
        using OLB with a "sufficient complex" reference lane from the same run usually results in an increase of data and quality.

        One can designate a lane on the flowcell to be the "control" while setting up a run on the HiSeq so RTA can handle strange samples on the machine itself. We do this for HiSeq runs that have amplicons on the flowcells. If this is the only reason you are running OLB off-line then you should not need to.

        Comment


        • #5
          Originally posted by GenoMax View Post
          One can designate a lane on the flowcell to be the "control" while setting up a run on the HiSeq so RTA can handle strange samples on the machine itself. We do this for HiSeq runs that have amplicons on the flowcells. If this is the only reason you are running OLB off-line then you should not need to.
          You mean "Control Lane" in the "Advanced" window of the flowcell setup?
          What you are loading in that lane? An arbitrary non-amplicon library or PhiX?
          So "control lane" is equivalent to "reference lane" in external basecalling in terms of actions to be taken?

          In case of an arbitrary non-amplicon library:

          How are "strange samples" distinguished from "good samples"? E.g., sometimes I get a fair amount of data (of fair quality) from non-complex samples with RTA;
          in this case I would decide to run OLB and often I get a far better yield (and quality). RTA may decide that these samples "behave normal" ..


          If I choose a control lane for a flowcell, is it used for all other lanes regardless of quality or just for those lanes where the sample seems to behave strange?
          The "User Guide" is not very helpful in that point ..

          Thanks for your comments.

          Comment


          • #6
            Originally posted by sklages View Post
            You mean "Control Lane" in the "Advanced" window of the flowcell setup?
            I am not on the experimental side of things but I am certain that is correct. I can ask if you need confirmation.

            Originally posted by sklages View Post
            What you are loading in that lane? An arbitrary non-amplicon library or PhiX?
            So "control lane" is equivalent to "reference lane" in external basecalling in terms of actions to be taken?
            Most lab Director's loath to waste a whole lane on phiX so generally a sample lane is selected that is known to have "normal" genomic DNA library(s).

            Originally posted by sklages View Post
            If I choose a control lane for a flowcell, is it used for all other lanes regardless of quality or just for those lanes where the sample seems to behave strange?
            The "User Guide" is not very helpful in that point ..

            Thanks for your comments.
            AFAIK, RTA users that "control" lane for the entire flowcell, equivalent to designating a "--control-lane".

            Here is the illumina recommendation from this page:

            Can I designate a control lane using HiSeq Control Software (HCS)?

            HCS allows you to designate a control lane during the run setup steps. Generally, you do not need to designate a control lane if the sequence you are analyzing has a balanced genome. In the case of an unbalanced or skewed base composition (e.g., bisulfite-treated samples) a control lane is recommended. This is not equivalent to a PhiX spike-in.

            Comment


            • #7
              Originally posted by GenoMax View Post
              AFAIK, RTA users that "control" lane for the entire flowcell, equivalent to designating a "--control-lane".

              Here is the illumina recommendation from this page:
              Thanks, this is something I was looking for (in the user manual).

              So this would be the way for probably most scenarios with low complexity samples.

              Nevertheless I'd like to speed up the 'make' process for offline basecalling ;-)

              Thanks for your valuable hints.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              51 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              68 views
              0 likes
              Last Post seqadmin  
              Working...
              X