SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Free webinar: De Novo Assembly of Complex Genomes janineNGSL Events / Conferences 0 12-01-2011 08:39 AM
de novo assembly of polyploid genomes? tsa De novo discovery 2 05-24-2011 12:24 AM
PubMed: State of the art de novo assembly of human genomes from massively parallel se Newsbot! Literature Watch 0 06-01-2010 02:00 AM
De novo assembly of human genomes with massively parallel short read sequencing dan Literature Watch 0 12-21-2009 04:40 AM
De-novo assembly of bacteria genomes - which tools? maasha Bioinformatics 8 08-12-2009 01:41 PM

Reply
 
Thread Tools
Old 10-17-2013, 05:31 PM   #1
anth
Member
 
Location: USA

Join Date: Jul 2011
Posts: 18
Question Hardware for de novo assembly of 1 Gb genomes

I recently started work in a lab where I have a budget of approximately $20,000 to spend on a workstation. This will be used for analysis of RNAseq data (using Trinity and, later, Tuxedo), as well as for the de novo genome assembly. Our model organism has a genome size of approximately 1 Gb.

I am new to genome assembly, but have read through the manuals of Allpaths-LG, SOAP De Novo, and Velvet and gleaned what I could regarding RAM use. Considering the size of our genome, I created a hardware configuration, with the largest consideration being the amount of RAM.

The configuration I've settled upon is as follows:
  • 4 Intel Xeon E5-4620 2.20GHz 8 Core CPUs
  • 512 GB memory (32 * 16 GB 1333 MHz RDIMMs)
  • 4 1.2 TB 10K SAS drives in a RAID 5 array

Is this sufficient? Or is additional hardware recommended / needed for de novo assembly of a genome of this scale?

I am open to any suggestions / criticisms regarding this configuration. If justifiable, the budget can potentially be expanded, if it is necessary for the aims mentioned above.

Thanks!
anth is offline   Reply With Quote
Old 10-17-2013, 07:21 PM   #2
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

minia supposedly only uses 5.7Gb RAM to assemble a human genome. That means you can do that in a low cost 16gb box.
ymc is offline   Reply With Quote
Old 10-17-2013, 07:23 PM   #3
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Having said that I would still suggest at least 64GB RAM because 32GB can be used for super fast RNA mapping with RNA-STAR
ymc is offline   Reply With Quote
Old 10-17-2013, 07:32 PM   #4
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Apparently you have a lot of money to burn, then 512GB makes sense. Remember to use at 400GB as ramdisk. It will speed your applications by another fold. Good luck!
ymc is offline   Reply With Quote
Old 10-18-2013, 08:08 AM   #5
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

@ymc: 'anth' is talking about rnaSeq assembly (well he is if he brings in Trinity) which is a whole different game than genome assembly. I wouldn't run Trinity in less than 128 GB. The suggestion for Trinity is "1 GB per 1 million pairs" which is a rough guide (and can be reduced via digitial normalization) but assuming that 'anth' is doing a large complex project with multiple samples then 'anth' will need lots of memory.

Report from TACC where they had 1 TB of memory to work with indicates that Trinity is speed up by 25% when using a ram disk. Not much of an improvement in my opinion.

512 GB may be overkill but 16 GB (or even 64 GB) will be too small.

My critique is that you need more disk space. My recommendation is at least 2 TB per Illumina hiSeq plate (8 lanes) that you will have sequenced (e.g, 300-400 GBases). You could have both a fast working space (your 4 10K drives) and a slower storage space.

Once again I am assuming a large project, multiple samples, multiple Illumina (or whatever sequencer) runs, etc.
westerman is offline   Reply With Quote
Old 10-18-2013, 09:08 AM   #6
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Another point about just specifying minimal hardware is that you limit yourself to a limited set of programs. E.g., buy a 16GB machine and then you can only use 'minia' which *supposedly* used only 5.7GB. Other assemblers are out of the question.

I haven't used minia (but should try it out) and looking at the web site I am thinking that the assembly generated is not very good (N50:1156 bases, longest: 18.6KBase). Perhaps a different assembler would work better? Say MaSuRCA (which I am just reading about). However MaSuRCA requires:

Quote:
Hardware requirements. The hardware requirements vary with the size of the genome project. Both Intel and AMD x64 architectures are supported. The general guidelines for hardware configuration are as follows:
* Bacteria (up to 10Mb): 16Gb RAM, 8+ cores, 10Gb disk space
* Insect (up to 500Mb): 128Gb RAM, 16+ cores, 1Tb disk space
* Avian/small plant genomes (up to 1Gb): 256Gb RAM, 32+ cores, 1Tb disk space
* Mammalian genomes (up to 3Gb): 512Gb RAM, 32+ cores, 3Tb disk space
* Plant genomes (up to 30Gb): 1Tb RAM, 64+cores, 10Tb disk space
If I had a 64 GB system I would be out of luck to even try out MuSuRCA.

So, 'anth', stick with your 512 GB system but do get more disk space.
westerman is offline   Reply With Quote
Old 10-18-2013, 10:18 AM   #7
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

IMO, 512 GB of RAM may be too much. Most assemblers I've used are under 512GB of RAM in a 2Gbp genome, but not all. So, you could buy the same machine but only equip it with 256GB of RAM and expand if you need it.

Also, I question the need for the E5-4620. Those are super expensive compared to the E5-2600 series. So, if you only need 256GB of RAM, you could drop to a much less expensive machine with the E5-2660 v2s for example.

This also depends on your needs outside of just assembly, as some things you'll be stuck doing require different core count vs core speed trade offs. As with the E5-4620s you'll be slower per core than the E5-2600s, in general.

For example, I've notice SOAPdenovo doesn't use more than about 20 cores all that well. So, you'd rather have 16 fast cores than 32 slower cores that don't get utilized very well for it. Other programs get stuck in single threaded parts of the assembly or other downstream analysis for large chucks of time.

If you can get your hands on some data from another 1Gbp genome that was sequenced in a similar way to what you're planning, and do a couple of test with the different programs on a cluster, you might get a better sense for what will work for you. Because right now you're buying a $20K machine and really you might be fine with on an $8K machine.

Also, I'd avoid RAID5. Some programs create tons and tons of small intermediate files. All the parity calculations required for that in RAID5/6 will greatly slow your machine down regardless of the CPUs you put in it. I'd say go RAID10, or have two RAIDs, one RAID0 for scratch and one RAID1 for archive (Ie for raw fastqs, critical intermediate steps that are computationally expense to recreate, and final assemblies). With this scheme, I'd avoid 10K SAS drives and just go for 7200 RPM 3 or 4TB SATAs. You could buy enterprise drives, but for most people the failure rate difference aren't really worth the extra costs since important stuff is going to be in RAID1 anyway.

For example, I think you'd be pretty happy with a 4x3TB RAID0 for ~6TB of total space and a 3x3TB RAID0 for ~9TB total space. Combined you'd have 15TBs of space, which should be plenty.
Wallysb01 is offline   Reply With Quote
Old 10-18-2013, 12:58 PM   #8
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by Wallysb01 View Post
IMO, 512 GB of RAM may be too much. Most assemblers I've used are under 512GB of RAM in a 2Gbp genome, but not all. So, you could buy the same machine but only equip it with 256GB of RAM and expand if you need it.
Not to detract from your excellent comments, I should point out that the money may not be there for expansion. It depends on what type of funding environment 'anth' is in. E.g., if it is "here is $20,000 and that is it forever and, by the way, please spend the entire $20K" then 'anth' will want to go for a high-end (and to us overly provisioned) system that will handle all needs -- not just 'most' -- for many years.

As for disk space, we really need to know the size of the project that 'anth' is going to be involved in. Personally, being at a sequencing facility, I can burn through 15 TB within a couple of months. But that would be about 8 hiSeq runs' worth of data. Don't know what 'anth' is going to look at.
westerman is offline   Reply With Quote
Old 10-18-2013, 01:49 PM   #9
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Quote:
Originally Posted by westerman View Post
Not to detract from your excellent comments, I should point out that the money may not be there for expansion. It depends on what type of funding environment 'anth' is in. E.g., if it is "here is $20,000 and that is it forever and, by the way, please spend the entire $20K" then 'anth' will want to go for a high-end (and to us overly provisioned) system that will handle all needs -- not just 'most' -- for many years.
Very good point. If you "have to" spend it all in one shot, by all means get that 4 socket system with 512GB of RAM.

Quote:
As for disk space, we really need to know the size of the project that 'anth' is going to be involved in. Personally, being at a sequencing facility, I can burn through 15 TB within a couple of months. But that would be about 8 hiSeq runs' worth of data. Don't know what 'anth' is going to look at.
I was assuming something in the realm of 1 hiSeq flow cell for this current project, but also with room to add another couple flow cells worth over time (thinking the machine would be used for at least 3 years), and with the need for lots of working space.

I'd also like to just point out that if you're going over ~15TB of needed space, you'll just need a dedicated storage solution. But given that anth was originally talking about 5TB of space, and assuming he's not so far off on that, I doubt he really needs a dedicated storage system.

Also, I completely missed that this was transcriptome assembly in addition to the genome. So, if that's the case, I even more strongly discourage RAID5. For a big trinity run without digital normalization, the chrysalis step will take forever, as its creating file numbers that increase with the total read count in the assembly. From personal experience on my own RAID5, this will bring your computer to its knees. Even clusters, that often have nontrivial latency to their storage systems running who knows what kind of file system, will get bogged down by this. Its just kind of an IO nightmare that is perfect for a local RAID0. If you're doing one big assembly and that's it, I suppose you can wait it out, but if you're tinkering with various assembly strategies on a large dataset, or many different datasets, run times can expand quickly.
Wallysb01 is offline   Reply With Quote
Old 10-18-2013, 03:14 PM   #10
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

"~15TB of needed space, you'll just need a dedicated storage solution."

Not necessarily. 4TB drives are readily available and cheap ( less than $200 at newegg : http://www.newegg.com/Product/Produc...CE&PageSize=20 )

15TB/4 = 3.7, so you'll only need 4 drives.

4*$200 =$800.

Cases with 4 bays are readily available : http://www.newegg.com/Product/Produc...der=BESTMATCH# - see the "internal 3.5" drive bay option".

You'll of course want to back this stuff up, so count on about 8 similarly priced external drives.
Richard Finney is offline   Reply With Quote
Old 10-18-2013, 04:06 PM   #11
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Quote:
Originally Posted by Richard Finney View Post
"~15TB of needed space, you'll just need a dedicated storage solution."

Not necessarily. 4TB drives are readily available and cheap ( less than $200 at newegg : http://www.newegg.com/Product/Produc...CE&PageSize=20 )

15TB/4 = 3.7, so you'll only need 4 drives.

4*$200 =$800.
Sure, you can get to 15TBs in RAID0 in a reasonable number of drives, but over your hole data solution, you'll need some redundancy. Most workstations/single blade servers have 6-8 bays. One of those is usually your boot drive and once accounting for redundancy, you really only have 3-5 disks of usable space for data. The formatted space on a 4TB disking is going to be about 3.7TB, so you really aren't effectively getting above 19TB of workable space in a single workstation.

Now, I guess some might want the whole thing RAID0 and get up over 20TB, then just backup a lot. But then you really need an iron clade back up that costs more (in both $$ and effort) than if you are running some redundancy on your own system.
Wallysb01 is offline   Reply With Quote
Old 10-18-2013, 05:48 PM   #12
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Minia is not the only low-memory genome assembler. See this comparison
http://www.plosone.org/article/info%...l.pone.0075505
krobison is offline   Reply With Quote
Old 10-19-2013, 02:43 AM   #13
mikesh
Member
 
Location: California

Join Date: Jul 2012
Posts: 29
Default

Well it is still possible to do transcriptome assembly with Trinity on 64Gb RAM, just have to do several tricks, like increasing min k-mer count for Inchworm.

Also agree that Minia the best possible choice. We had also tried SGA, but it was really slow in our hands.

I still don't understand what the author of topic is planning to do? Transcriptome de-novo, genome, or just novel transcripts&splicing (transcriptome + known genome)...

PS We had tried to order a similar hardware from Dell, and it was really expensive, around 30k$. Perhaps you should reconsider 1) lowering your RAM 2) Choosing other chipset, eg 2 Intel Xeon E5-26xx

Last edited by mikesh; 10-19-2013 at 02:53 AM.
mikesh is offline   Reply With Quote
Old 10-19-2013, 10:37 AM   #14
anth
Member
 
Location: USA

Join Date: Jul 2011
Posts: 18
Default

Thank you for all of the responses. I truly appreciate it.

One of the other options we had considered was going with a pair of Intel Xeon E5-2650v2 2.6GHz CPUs. However, in the end, the price ended up being not too far off from a configuration with four E5-4620 2.20GHz CPUs. This is where it gets a bid muddy - our IT department has a strong preference for a Dell machine, and an R720 (which supports 2x 26xx series Xeons) needs 32 GB LRDIMMs to be able to fit 512 GB of memory, while an R820 (supporting the 46xx series of CPUs) could go to 768 GB with 16 GB RDIMMs.

In the end, we trade a bit of clock speed for twice as many cores, and the space to potentially go to 768 GB of memory.

It seems like a reasonable compromise, given the discussion here, and I realize that parts of a pipeline that drop to a single thread will be slightly slower on the original 4 CPU configuration I mentioned than on the dual 26xx. However, tasks that can take advantage of all cores should be faster.

There has been some question as to my applications - thus far, I have been using Trinity on a colleague's machine. With the datasets I've been looking at this far (Illumina RNAseq data, from 2-3 lanes), it's truly incredible how much memory it can consume, as I'm sure you're more aware of than I!

Original plans had been to use a draft genome that someone else had assembled, but it has become clear that this genome is not sufficiently well-assembled to be of utility for RNAseq analysis. As such, following the sequencing of a few more genomic libraries, this machine will be used for assembly of a few 1 Gb genomes.

I have taken the storage concerns to heart, and will certainly implement those in the final configuration of the machine.

Thanks again!

anth
anth is offline   Reply With Quote
Old 10-19-2013, 11:49 AM   #15
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Quote:
Originally Posted by anth View Post
Thank you for all of the responses. I truly appreciate it.

One of the other options we had considered was going with a pair of Intel Xeon E5-2650v2 2.6GHz CPUs. However, in the end, the price ended up being not too far off from a configuration with four E5-4620 2.20GHz CPUs. This is where it gets a bid muddy - our IT department has a strong preference for a Dell machine, and an R720 (which supports 2x 26xx series Xeons) needs 32 GB LRDIMMs to be able to fit 512 GB of memory, while an R820 (supporting the 46xx series of CPUs) could go to 768 GB with 16 GB RDIMMs.
And the RAM is the reason why the cost is about the same. Those 32GB modules are much more expensive per GB than the 16GB modules. So, that's really the cost of going from 256GB of RAM to 512 or above, as IMO the 46xx series really doesn't offer much of an advantage over the 26xx v2.

Quote:
It seems like a reasonable compromise, given the discussion here, and I realize that parts of a pipeline that drop to a single thread will be slightly slower on the original 4 CPU configuration I mentioned than on the dual 26xx. However, tasks that can take advantage of all cores should be faster.
There is a little more than just the straight CPU clock speed difference working here. The 2600v2s (ivy bridge) will be faster than the 4600 (sandy bridge) clock for clock. Also, the 4 socket systems will have some additional latency talking between all the processors and RAM over a 2 socket system. Plus, there is the turbo boosting in low threaded work flows. The 2650v2 tops out at 3.6GHz while the 4620 is at 2.6. Put it all together and single threaded performance might be 50% faster in the 2650.


Altogether though, your choice is reasonable, just presenting the alternatives.

Last edited by Wallysb01; 10-19-2013 at 02:15 PM.
Wallysb01 is offline   Reply With Quote
Old 10-19-2013, 02:20 PM   #16
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

Something to consider too, is that the 4600v2 is due out in 2014 Q1. So, depending on when you have to spend your money by, or when you need to get the work started, it may be worth it to wait a few months if you're set on the 4600 series. You'll probably see at least 2 additional cores, another .1-.2 GHz and maybe a little more range in the turbo boost at the same price levels going from 4600 to 4600v2. So the 4620 might be more like a 10-core 2.3GHz with turbo to 2.8.
Wallysb01 is offline   Reply With Quote
Old 10-19-2013, 07:59 PM   #17
mike.t
Member
 
Location: Spain

Join Date: Mar 2010
Posts: 36
Default

I'll just add my $0.02 worth -

Consider increasing your storage by several fold. That machine may last you several years but your data storage needs will grow over time. You also need to consider data backups and the hardware you may need for that. If you have access to a cluster or some other shared big memory machine then you could always do your biggest jobs there. Its convenient to be able to do all of your work on your own machine but its really inconvenient to have to continually shuffle files around because you main file system is too small. Its also really inconvenient to lose data because of a hardware failure. Make sure you have reliable off-site backup.
mike.t is offline   Reply With Quote
Old 10-20-2013, 08:22 AM   #18
anth
Member
 
Location: USA

Join Date: Jul 2011
Posts: 18
Default

Thanks for your input, Wallysb01. I will certainly not go with RAID5. Actually, the colleague's machine I have been using has both a RAID1 and RAID0 array as you describe, and it has worked nicely for Trinity. I will certainly up the storage, likely with a few 4TB 3.5" SATA drives.

You provide a valid reason for consideration of 2x 2600 v2 instead of 4x 4600, especially since the higher end 2600 V2s (e.g. the E5-2697V2) are available with up to 12 cores. This then becomes a 24 faster core vs. 32 slower core situation...

Once again, thank you everyone.
anth is offline   Reply With Quote
Old 10-22-2013, 06:26 PM   #19
anth
Member
 
Location: USA

Join Date: Jul 2011
Posts: 18
Default

Hello again,

I have yet another consideration. Looking at various configurations, the lower price of the AMD Opteron CPUs quickly becomes apparent, in comparison to Xeon 2600v2 and 4600 CPUs.

Is there any reason to shy away from an Opteron 6300 series CPU for the aforementioned applications?

I can go to a 4 CPU system, be able to get 512 GB RAM at a more reasonably price (as I'd able to use 32 * 16 GB DIMMs, instead of the pricier 16 * 32 GB), and still come in at a far less expensive machine...

And have no fear, I've heeded the advice and I'm going to several 4 TB drives. One question there - is RAID cache in the form of an LSI Nytro MegaRAID particularly useful for assembly and mapping?

Thanks again.
anth is offline   Reply With Quote
Old 10-22-2013, 10:45 PM   #20
Wallysb01
Senior Member
 
Location: San Francisco, CA

Join Date: Feb 2011
Posts: 286
Default

The Opterons are substantially slower than the Intel equivalents. Also, don't get fooled by the core count the Opterons boast. For floating point calculations (which is much of bioinformatics) those cores are cut in half. That's because the thing to count is the module for floating point, and each module has two cores. Its a bit like Intel hyper threading.

That said, it is a cost effective way to get a high memory machine. So, if you really want 512GB of RAM for <$10K (or there abouts), AMD is generally what you'll be looking at. But be prepared for up to 50% slower performance (single/lowly threaded stuff will especially suffer). Those 6300s are just plain old too. AMD didn't really move forward much from 6200 to 6300, so its the same basic technology from about 2 years ago. Supposedly the Warsaw series of workstation CPUs from AMD will be coming out early 2014, but who really knows if that's true, and even if they show up, who knows how competitive they will be with E5-2600 v2s.

As for the other question, I don't have any experience with that kind of set up, so hopefully someone else can jump in.
Wallysb01 is offline   Reply With Quote
Reply

Tags
de novo assembly, hardware

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:28 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO