Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Corona Lite Running Times?

    The manual suggests one day given enough CPUs. Anyone with experience here? What is the speed of your CPUs? How many do you use? How long does it take in your hands?

  • #2
    Unless I have missed something in the new release, each of the corona lite programs runs on a single processor. However like many bioinformatics programs the corona lite programs can be run in 'embarrassingly parallel' mode. I.e., break down your reference sequence by chromosome or other convenient segment and/or the SOLiD file into enough parts to use up your processors.

    The matching part of the corona lite pipeline has 6 parts with the 1st and 3rd part being able to be split up. The other 4 parts are solely single processor but also are really just file copies and thus can be fairly fast.

    As for overall time it depends, obviously, on the size of your SOLiD data set -- those 14-20 GB files take a while to toss around -- and your reference sequence. The time also go up in a non-linear fashion depending on how many mis-matches you wish to take into consideration.

    A big consideration is having enough disk space, both temporary and permanent, to handle the files.

    Since I usually work with partially assembled genomes (i.e., lots of contigs) or CDS or EST projects it is quite often the case that I split up the reference into 64 parts and use all 64 CPUs that I have at my disposal. The ultimate speed of the CPUs really doesn't matter that much. Obviously the faster the better. But I would concentrate more on disk speed and physical memory and exactly how many mismatches you want. 1 mismatch is trivial. 3 (the recommend) less so. 6 or more almost impossible on any sizable dataset.

    And, yes, I would say 1-2 days of processing given enough CPUs. My recent work on the bee assembly 4 took about 36 hours to go through the matching steps. But I didn't break down the chromosomes nor SOLiD data set and so only used about 1/4 of my CPUs. There are other people on the machine and despite my hoggish nature I did want to play nicely (for once!) SNP calling added time to that process.
    Last edited by westerman; 01-12-2009, 01:48 PM.

    Comment


    • #3
      thanks for the extensive reply.

      with your experience, how long does a standard run alignment take?
      I'm just estimating order of magnitude... hours? days? week?

      Is it feasible to match on 1 CPU for a whole run? do I need to have access to servers/clusters with more CPUs?

      Comment


      • #4
        It should be possible to match using 1 CPU given enough memory (4 GB). Given my experience I would expect running times of about 3 weeks for a non-paired mapping of a SOLiD data set to the reference bee genome. SNP calling would probably take an extra week. But I may be pessimistic.

        In any case it will take time and you better hope that your computer stays up and running during the process. Last week I had two instances of the computer or file server crashing on me. They were rare instances that should not occur but irritating never-the-less.

        Comment


        • #5
          wow! three weeks! thanks for letting me know... we really need Bowtie to do colorspace soon... hopefully cutting it down to hours.

          have you used MAQ for colorspace? Any ideas how long this takes on 1 CPU? Is it also weeks?

          Comment


          • #6
            You could try ZOOM!, it does CS alignments and is pretty fast. Have not compared it to Corona but at least it requires much less disk space.

            Comment


            • #7
              thanks. any ideas how long it take (order of magnitude) for one SOLID run 15-20G to the human genome (on 1 CPU)?

              Comment


              • #8
                No, it depends on how much memory you have, how long sequences, how many mismatrches etc. I am planning to do some tests soon so let me know what you think a good benchmark would be.

                Comment


                • #9
                  Great. I look forward to the results.

                  Comment


                  • #10
                    For human genome and 50G reads, it tooks about 36 hours for mapping (1 cpu, 6G memory).

                    Comment


                    • #11
                      Thanks for the response.

                      is this ZOOM or Corona?

                      Comment


                      • #12
                        corona-lite

                        Comment


                        • #13
                          for future reference,
                          the size of your reference is a HUGE factor here. from hours to weeks...

                          dataset and reference

                          Comment


                          • #14
                            You could also try a program I have authored caled BFAST: the Blat-like Fast Accurate Search Tool. You can find download instructions at:



                            Nils Homer

                            Comment


                            • #15
                              If your time is precious, try ISAS. Native colorspace and as far as we know its the fastest - if I'm wrong and there is a faster solution please enlighten me !


                              100 million 25mers on one computer in 30 minutes.
                              3G human reference, 2 mismatches.
                              Results identical to corona (just 100 times faster) and same format

                              See the ISAS thread for more info.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 08:47 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X