Hello everyone.
We've just released Seal (http://biodoop-seal.sourceforge.net/), a Hadoop-
based distributed short read alignment and analysis toolkit. Currently SEAL
includes tools for: read alignment (based on BWA), duplicate read removal,
and sorting read mappings. SEAL scales, easily handling TB of data. If you’re
aligning read data sets of more than a couple of hundred MB, and you have a
cluster of computers (even a small one, say 4 or 5 nodes, and up to hundreds
of nodes) then Seal might be for you.
On a 16-node Hadoop cluster, with 8 cores and 16 GB of RAM per node, we have
measured map+rmdup throughputs of 13 Gbp / hour, and 19 Gbp / hour in map-only
mode. Scalability tests show that the throughput per node is maintained as
the number of nodes increases through to 128.
We have been working on Seal to support the needs of the CRS4 Sequencing
laboratory, which operates 6 Illumina sequencing machines and thus generates
lots of data to process. The regular workflow was being overwhelmed
notwithstanding the increased number of computers made available and was
regularly overloading our Lustre shared storage volume. Now all
data processing at the lab starts with Seal, with very positive results with
respect to speed and maintenance effort.
In case you were wondering, Hadoop (http://hadoop.apache.org/) is an open
source, distributed, and robust MapReduce framework for data-intensive
processing, providing a distributed computing system and a distributed file
system.
We're eager to get people to try our new tool. Please visit the Seal web site
(http://biodoop-seal.sourceforge.net/) and feel free to contact myself or the
other Seal authors if you have any question or problems.
--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel: +39 0709250452
We've just released Seal (http://biodoop-seal.sourceforge.net/), a Hadoop-
based distributed short read alignment and analysis toolkit. Currently SEAL
includes tools for: read alignment (based on BWA), duplicate read removal,
and sorting read mappings. SEAL scales, easily handling TB of data. If you’re
aligning read data sets of more than a couple of hundred MB, and you have a
cluster of computers (even a small one, say 4 or 5 nodes, and up to hundreds
of nodes) then Seal might be for you.
On a 16-node Hadoop cluster, with 8 cores and 16 GB of RAM per node, we have
measured map+rmdup throughputs of 13 Gbp / hour, and 19 Gbp / hour in map-only
mode. Scalability tests show that the throughput per node is maintained as
the number of nodes increases through to 128.
We have been working on Seal to support the needs of the CRS4 Sequencing
laboratory, which operates 6 Illumina sequencing machines and thus generates
lots of data to process. The regular workflow was being overwhelmed
notwithstanding the increased number of computers made available and was
regularly overloading our Lustre shared storage volume. Now all
data processing at the lab starts with Seal, with very positive results with
respect to speed and maintenance effort.
In case you were wondering, Hadoop (http://hadoop.apache.org/) is an open
source, distributed, and robust MapReduce framework for data-intensive
processing, providing a distributed computing system and a distributed file
system.
We're eager to get people to try our new tool. Please visit the Seal web site
(http://biodoop-seal.sourceforge.net/) and feel free to contact myself or the
other Seal authors if you have any question or problems.
--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel: +39 0709250452
Comment