Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hardware for de novo assembly of 1 Gb genomes

    I recently started work in a lab where I have a budget of approximately $20,000 to spend on a workstation. This will be used for analysis of RNAseq data (using Trinity and, later, Tuxedo), as well as for the de novo genome assembly. Our model organism has a genome size of approximately 1 Gb.

    I am new to genome assembly, but have read through the manuals of Allpaths-LG, SOAP De Novo, and Velvet and gleaned what I could regarding RAM use. Considering the size of our genome, I created a hardware configuration, with the largest consideration being the amount of RAM.

    The configuration I've settled upon is as follows:
    • 4 Intel Xeon E5-4620 2.20GHz 8 Core CPUs
    • 512 GB memory (32 * 16 GB 1333 MHz RDIMMs)
    • 4 1.2 TB 10K SAS drives in a RAID 5 array


    Is this sufficient? Or is additional hardware recommended / needed for de novo assembly of a genome of this scale?

    I am open to any suggestions / criticisms regarding this configuration. If justifiable, the budget can potentially be expanded, if it is necessary for the aims mentioned above.

    Thanks!

  • #2
    minia supposedly only uses 5.7Gb RAM to assemble a human genome. That means you can do that in a low cost 16gb box.

    Comment


    • #3
      Having said that I would still suggest at least 64GB RAM because 32GB can be used for super fast RNA mapping with RNA-STAR

      Comment


      • #4
        Apparently you have a lot of money to burn, then 512GB makes sense. Remember to use at 400GB as ramdisk. It will speed your applications by another fold. Good luck!

        Comment


        • #5
          @ymc: 'anth' is talking about rnaSeq assembly (well he is if he brings in Trinity) which is a whole different game than genome assembly. I wouldn't run Trinity in less than 128 GB. The suggestion for Trinity is "1 GB per 1 million pairs" which is a rough guide (and can be reduced via digitial normalization) but assuming that 'anth' is doing a large complex project with multiple samples then 'anth' will need lots of memory.

          Report from TACC where they had 1 TB of memory to work with indicates that Trinity is speed up by 25% when using a ram disk. Not much of an improvement in my opinion.

          512 GB may be overkill but 16 GB (or even 64 GB) will be too small.

          My critique is that you need more disk space. My recommendation is at least 2 TB per Illumina hiSeq plate (8 lanes) that you will have sequenced (e.g, 300-400 GBases). You could have both a fast working space (your 4 10K drives) and a slower storage space.

          Once again I am assuming a large project, multiple samples, multiple Illumina (or whatever sequencer) runs, etc.

          Comment


          • #6
            Another point about just specifying minimal hardware is that you limit yourself to a limited set of programs. E.g., buy a 16GB machine and then you can only use 'minia' which *supposedly* used only 5.7GB. Other assemblers are out of the question.

            I haven't used minia (but should try it out) and looking at the web site I am thinking that the assembly generated is not very good (N50:1156 bases, longest: 18.6KBase). Perhaps a different assembler would work better? Say MaSuRCA (which I am just reading about). However MaSuRCA requires:

            Hardware requirements. The hardware requirements vary with the size of the genome project. Both Intel and AMD x64 architectures are supported. The general guidelines for hardware configuration are as follows:
            * Bacteria (up to 10Mb): 16Gb RAM, 8+ cores, 10Gb disk space
            * Insect (up to 500Mb): 128Gb RAM, 16+ cores, 1Tb disk space
            * Avian/small plant genomes (up to 1Gb): 256Gb RAM, 32+ cores, 1Tb disk space
            * Mammalian genomes (up to 3Gb): 512Gb RAM, 32+ cores, 3Tb disk space
            * Plant genomes (up to 30Gb): 1Tb RAM, 64+cores, 10Tb disk space
            If I had a 64 GB system I would be out of luck to even try out MuSuRCA.

            So, 'anth', stick with your 512 GB system but do get more disk space.

            Comment


            • #7
              IMO, 512 GB of RAM may be too much. Most assemblers I've used are under 512GB of RAM in a 2Gbp genome, but not all. So, you could buy the same machine but only equip it with 256GB of RAM and expand if you need it.

              Also, I question the need for the E5-4620. Those are super expensive compared to the E5-2600 series. So, if you only need 256GB of RAM, you could drop to a much less expensive machine with the E5-2660 v2s for example.

              This also depends on your needs outside of just assembly, as some things you'll be stuck doing require different core count vs core speed trade offs. As with the E5-4620s you'll be slower per core than the E5-2600s, in general.

              For example, I've notice SOAPdenovo doesn't use more than about 20 cores all that well. So, you'd rather have 16 fast cores than 32 slower cores that don't get utilized very well for it. Other programs get stuck in single threaded parts of the assembly or other downstream analysis for large chucks of time.

              If you can get your hands on some data from another 1Gbp genome that was sequenced in a similar way to what you're planning, and do a couple of test with the different programs on a cluster, you might get a better sense for what will work for you. Because right now you're buying a $20K machine and really you might be fine with on an $8K machine.

              Also, I'd avoid RAID5. Some programs create tons and tons of small intermediate files. All the parity calculations required for that in RAID5/6 will greatly slow your machine down regardless of the CPUs you put in it. I'd say go RAID10, or have two RAIDs, one RAID0 for scratch and one RAID1 for archive (Ie for raw fastqs, critical intermediate steps that are computationally expense to recreate, and final assemblies). With this scheme, I'd avoid 10K SAS drives and just go for 7200 RPM 3 or 4TB SATAs. You could buy enterprise drives, but for most people the failure rate difference aren't really worth the extra costs since important stuff is going to be in RAID1 anyway.

              For example, I think you'd be pretty happy with a 4x3TB RAID0 for ~6TB of total space and a 3x3TB RAID0 for ~9TB total space. Combined you'd have 15TBs of space, which should be plenty.

              Comment


              • #8
                Originally posted by Wallysb01 View Post
                IMO, 512 GB of RAM may be too much. Most assemblers I've used are under 512GB of RAM in a 2Gbp genome, but not all. So, you could buy the same machine but only equip it with 256GB of RAM and expand if you need it.
                Not to detract from your excellent comments, I should point out that the money may not be there for expansion. It depends on what type of funding environment 'anth' is in. E.g., if it is "here is $20,000 and that is it forever and, by the way, please spend the entire $20K" then 'anth' will want to go for a high-end (and to us overly provisioned) system that will handle all needs -- not just 'most' -- for many years.

                As for disk space, we really need to know the size of the project that 'anth' is going to be involved in. Personally, being at a sequencing facility, I can burn through 15 TB within a couple of months. But that would be about 8 hiSeq runs' worth of data. Don't know what 'anth' is going to look at.

                Comment


                • #9
                  Originally posted by westerman View Post
                  Not to detract from your excellent comments, I should point out that the money may not be there for expansion. It depends on what type of funding environment 'anth' is in. E.g., if it is "here is $20,000 and that is it forever and, by the way, please spend the entire $20K" then 'anth' will want to go for a high-end (and to us overly provisioned) system that will handle all needs -- not just 'most' -- for many years.
                  Very good point. If you "have to" spend it all in one shot, by all means get that 4 socket system with 512GB of RAM.

                  As for disk space, we really need to know the size of the project that 'anth' is going to be involved in. Personally, being at a sequencing facility, I can burn through 15 TB within a couple of months. But that would be about 8 hiSeq runs' worth of data. Don't know what 'anth' is going to look at.
                  I was assuming something in the realm of 1 hiSeq flow cell for this current project, but also with room to add another couple flow cells worth over time (thinking the machine would be used for at least 3 years), and with the need for lots of working space.

                  I'd also like to just point out that if you're going over ~15TB of needed space, you'll just need a dedicated storage solution. But given that anth was originally talking about 5TB of space, and assuming he's not so far off on that, I doubt he really needs a dedicated storage system.

                  Also, I completely missed that this was transcriptome assembly in addition to the genome. So, if that's the case, I even more strongly discourage RAID5. For a big trinity run without digital normalization, the chrysalis step will take forever, as its creating file numbers that increase with the total read count in the assembly. From personal experience on my own RAID5, this will bring your computer to its knees. Even clusters, that often have nontrivial latency to their storage systems running who knows what kind of file system, will get bogged down by this. Its just kind of an IO nightmare that is perfect for a local RAID0. If you're doing one big assembly and that's it, I suppose you can wait it out, but if you're tinkering with various assembly strategies on a large dataset, or many different datasets, run times can expand quickly.

                  Comment


                  • #10
                    "~15TB of needed space, you'll just need a dedicated storage solution."

                    Not necessarily. 4TB drives are readily available and cheap ( less than $200 at newegg : http://www.newegg.com/Product/Produc...CE&PageSize=20 )

                    15TB/4 = 3.7, so you'll only need 4 drives.

                    4*$200 =$800.

                    Cases with 4 bays are readily available : http://www.newegg.com/Product/Produc...der=BESTMATCH# - see the "internal 3.5" drive bay option".

                    You'll of course want to back this stuff up, so count on about 8 similarly priced external drives.

                    Comment


                    • #11
                      Originally posted by Richard Finney View Post
                      "~15TB of needed space, you'll just need a dedicated storage solution."

                      Not necessarily. 4TB drives are readily available and cheap ( less than $200 at newegg : http://www.newegg.com/Product/Produc...CE&PageSize=20 )

                      15TB/4 = 3.7, so you'll only need 4 drives.

                      4*$200 =$800.
                      Sure, you can get to 15TBs in RAID0 in a reasonable number of drives, but over your hole data solution, you'll need some redundancy. Most workstations/single blade servers have 6-8 bays. One of those is usually your boot drive and once accounting for redundancy, you really only have 3-5 disks of usable space for data. The formatted space on a 4TB disking is going to be about 3.7TB, so you really aren't effectively getting above 19TB of workable space in a single workstation.

                      Now, I guess some might want the whole thing RAID0 and get up over 20TB, then just backup a lot. But then you really need an iron clade back up that costs more (in both $$ and effort) than if you are running some redundancy on your own system.

                      Comment


                      • #12
                        Minia is not the only low-memory genome assembler. See this comparison
                        A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this report, we compare current memory-efficient techniques for genome assembly with respect to quality, memory consumption and execution time. Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. Our study reveals the minimum memory requirements for different assembly programs even when data volume exceeds memory capacity by orders of magnitude. By combining existing methodologies, we propose two general assembly strategies that can improve short-read assembly approaches and result in reduction of the memory footprint. Finally, we discuss the possibility of utilizing cloud infrastructures for genome assembly and we comment on some findings regarding suitable computational resources for assembly.

                        Comment


                        • #13
                          Well it is still possible to do transcriptome assembly with Trinity on 64Gb RAM, just have to do several tricks, like increasing min k-mer count for Inchworm.

                          Also agree that Minia the best possible choice. We had also tried SGA, but it was really slow in our hands.

                          I still don't understand what the author of topic is planning to do? Transcriptome de-novo, genome, or just novel transcripts&splicing (transcriptome + known genome)...

                          PS We had tried to order a similar hardware from Dell, and it was really expensive, around 30k$. Perhaps you should reconsider 1) lowering your RAM 2) Choosing other chipset, eg 2 Intel Xeon E5-26xx
                          Last edited by mikesh; 10-19-2013, 02:53 AM.

                          Comment


                          • #14
                            Thank you for all of the responses. I truly appreciate it.

                            One of the other options we had considered was going with a pair of Intel Xeon E5-2650v2 2.6GHz CPUs. However, in the end, the price ended up being not too far off from a configuration with four E5-4620 2.20GHz CPUs. This is where it gets a bid muddy - our IT department has a strong preference for a Dell machine, and an R720 (which supports 2x 26xx series Xeons) needs 32 GB LRDIMMs to be able to fit 512 GB of memory, while an R820 (supporting the 46xx series of CPUs) could go to 768 GB with 16 GB RDIMMs.

                            In the end, we trade a bit of clock speed for twice as many cores, and the space to potentially go to 768 GB of memory.

                            It seems like a reasonable compromise, given the discussion here, and I realize that parts of a pipeline that drop to a single thread will be slightly slower on the original 4 CPU configuration I mentioned than on the dual 26xx. However, tasks that can take advantage of all cores should be faster.

                            There has been some question as to my applications - thus far, I have been using Trinity on a colleague's machine. With the datasets I've been looking at this far (Illumina RNAseq data, from 2-3 lanes), it's truly incredible how much memory it can consume, as I'm sure you're more aware of than I!

                            Original plans had been to use a draft genome that someone else had assembled, but it has become clear that this genome is not sufficiently well-assembled to be of utility for RNAseq analysis. As such, following the sequencing of a few more genomic libraries, this machine will be used for assembly of a few 1 Gb genomes.

                            I have taken the storage concerns to heart, and will certainly implement those in the final configuration of the machine.

                            Thanks again!

                            anth

                            Comment


                            • #15
                              Originally posted by anth View Post
                              Thank you for all of the responses. I truly appreciate it.

                              One of the other options we had considered was going with a pair of Intel Xeon E5-2650v2 2.6GHz CPUs. However, in the end, the price ended up being not too far off from a configuration with four E5-4620 2.20GHz CPUs. This is where it gets a bid muddy - our IT department has a strong preference for a Dell machine, and an R720 (which supports 2x 26xx series Xeons) needs 32 GB LRDIMMs to be able to fit 512 GB of memory, while an R820 (supporting the 46xx series of CPUs) could go to 768 GB with 16 GB RDIMMs.
                              And the RAM is the reason why the cost is about the same. Those 32GB modules are much more expensive per GB than the 16GB modules. So, that's really the cost of going from 256GB of RAM to 512 or above, as IMO the 46xx series really doesn't offer much of an advantage over the 26xx v2.

                              It seems like a reasonable compromise, given the discussion here, and I realize that parts of a pipeline that drop to a single thread will be slightly slower on the original 4 CPU configuration I mentioned than on the dual 26xx. However, tasks that can take advantage of all cores should be faster.
                              There is a little more than just the straight CPU clock speed difference working here. The 2600v2s (ivy bridge) will be faster than the 4600 (sandy bridge) clock for clock. Also, the 4 socket systems will have some additional latency talking between all the processors and RAM over a 2 socket system. Plus, there is the turbo boosting in low threaded work flows. The 2650v2 tops out at 3.6GHz while the 4620 is at 2.6. Put it all together and single threaded performance might be 50% faster in the 2650.


                              Altogether though, your choice is reasonable, just presenting the alternatives.
                              Last edited by Wallysb01; 10-19-2013, 02:15 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X