Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sklages
    Senior Member
    • May 2008
    • 628

    OLB - basecalling on HiSeq data extremely slow

    Hi,

    I am not really happy with the current external basecalling pipeline.

    Hardware: 48 cores (AMD Opteron Processor 6176 SE), 256GB RAM
    System: Linux, 64bit, make 3.82, OLB 1.9.3

    No other load on the machine, data is stored locally (xfs).

    Apart from the fact that 'make' is probably the wrong tool for this task,
    what are your experiences with external basecalling using OLB?

    For a current Hiseq run (2x101bp+index, roughly 160K CIF files) 'make'
    needs appr. 4 days to read the makefile and its includes (preparing the
    execution of the actual basecalling step) using one CPU. When it has
    finished the actual basecalling starts multithreaded and is usually finished
    overnight (a few hours).

    Any comments on that?

    thanks,
    Sven
    Last edited by sklages; 03-05-2013, 07:16 AM. Reason: typo
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    I am mildly curious as to why you are still using OLB off-line?

    Re-read your message again. Just the process of reading the makefiles is taking 4 days? How big are these files?
    Last edited by GenoMax; 03-05-2013, 09:09 AM.

    Comment

    • sklages
      Senior Member
      • May 2008
      • 628

      #3
      Originally posted by GenoMax View Post
      I am mildly curious as to why you are still using OLB off-line?

      Re-read your message again. Just the process of reading the makefiles is taking 4 days? How big are these files?
      We sometimes have libraries with low complexity (e.g. with barcodes at the beginning of read 1 or some ChiPseq experiments);
      using OLB with a "sufficient complex" reference lane from the same run usually results in an increase of data and quality.

      Yes, 'make' takes a long time resolving the dependencies in the makefile (and its includes). This seems to be a very CPU-bound process,
      it produces not much I/O. But as this is still the preparation of the actual parallelized basecalling step, it runs on one CPU only.

      Have a look at the makefile(s); not "debuggable" :-)

      Reading (stat()) all the CIFs (160K files) and building the data structure for the jobs to be executed gets incredibly slow the more files are read.

      This might be a problem of 'make' (though it is not the right tool IMHO) or maybe a bug in the makefile and its dependencies/includes. Hard to tell :-)

      I was just wondering how others are experiencing OLB in terms of speed; especially the make part.

      Of course, any other ideas how to deal with "low complexity" lanes are welcome.
      I was usually going with Bustard/OLB with good results in these cases.

      I am looking forward for RTA to be used externally; hopefully Illumina does not employ 'make' for this task as well ...

      ... analyzing the makefiles, the structure, the order, the executables, and then writing an independent wrapper for external basecalling .. is then probably the best way to go

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        Originally posted by sklages View Post
        We sometimes have libraries with low complexity (e.g. with barcodes at the beginning of read 1 or some ChiPseq experiments);
        using OLB with a "sufficient complex" reference lane from the same run usually results in an increase of data and quality.

        One can designate a lane on the flowcell to be the "control" while setting up a run on the HiSeq so RTA can handle strange samples on the machine itself. We do this for HiSeq runs that have amplicons on the flowcells. If this is the only reason you are running OLB off-line then you should not need to.

        Comment

        • sklages
          Senior Member
          • May 2008
          • 628

          #5
          Originally posted by GenoMax View Post
          One can designate a lane on the flowcell to be the "control" while setting up a run on the HiSeq so RTA can handle strange samples on the machine itself. We do this for HiSeq runs that have amplicons on the flowcells. If this is the only reason you are running OLB off-line then you should not need to.
          You mean "Control Lane" in the "Advanced" window of the flowcell setup?
          What you are loading in that lane? An arbitrary non-amplicon library or PhiX?
          So "control lane" is equivalent to "reference lane" in external basecalling in terms of actions to be taken?

          In case of an arbitrary non-amplicon library:

          How are "strange samples" distinguished from "good samples"? E.g., sometimes I get a fair amount of data (of fair quality) from non-complex samples with RTA;
          in this case I would decide to run OLB and often I get a far better yield (and quality). RTA may decide that these samples "behave normal" ..


          If I choose a control lane for a flowcell, is it used for all other lanes regardless of quality or just for those lanes where the sample seems to behave strange?
          The "User Guide" is not very helpful in that point ..

          Thanks for your comments.

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            Originally posted by sklages View Post
            You mean "Control Lane" in the "Advanced" window of the flowcell setup?
            I am not on the experimental side of things but I am certain that is correct. I can ask if you need confirmation.

            Originally posted by sklages View Post
            What you are loading in that lane? An arbitrary non-amplicon library or PhiX?
            So "control lane" is equivalent to "reference lane" in external basecalling in terms of actions to be taken?
            Most lab Director's loath to waste a whole lane on phiX so generally a sample lane is selected that is known to have "normal" genomic DNA library(s).

            Originally posted by sklages View Post
            If I choose a control lane for a flowcell, is it used for all other lanes regardless of quality or just for those lanes where the sample seems to behave strange?
            The "User Guide" is not very helpful in that point ..

            Thanks for your comments.
            AFAIK, RTA users that "control" lane for the entire flowcell, equivalent to designating a "--control-lane".

            Here is the illumina recommendation from this page:

            Can I designate a control lane using HiSeq Control Software (HCS)?

            HCS allows you to designate a control lane during the run setup steps. Generally, you do not need to designate a control lane if the sequence you are analyzing has a balanced genome. In the case of an unbalanced or skewed base composition (e.g., bisulfite-treated samples) a control lane is recommended. This is not equivalent to a PhiX spike-in.

            Comment

            • sklages
              Senior Member
              • May 2008
              • 628

              #7
              Originally posted by GenoMax View Post
              AFAIK, RTA users that "control" lane for the entire flowcell, equivalent to designating a "--control-lane".

              Here is the illumina recommendation from this page:
              Thanks, this is something I was looking for (in the user manual).

              So this would be the way for probably most scenarios with low complexity samples.

              Nevertheless I'd like to speed up the 'make' process for offline basecalling ;-)

              Thanks for your valuable hints.

              Comment

              Latest Articles

              Collapse

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              17 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              27 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              38 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 12:03 PM
              0 responses
              61 views
              0 reactions
              Last Post SEQadmin2  
              Working...