Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Announcing Seal 0.1.0: BWA alignment on Hadoop

    Hello everyone.

    We've just released Seal (http://biodoop-seal.sourceforge.net/), a Hadoop-
    based distributed short read alignment and analysis toolkit. Currently SEAL
    includes tools for: read alignment (based on BWA), duplicate read removal,
    and sorting read mappings. SEAL scales, easily handling TB of data. If you’re
    aligning read data sets of more than a couple of hundred MB, and you have a
    cluster of computers (even a small one, say 4 or 5 nodes, and up to hundreds
    of nodes) then Seal might be for you.

    On a 16-node Hadoop cluster, with 8 cores and 16 GB of RAM per node, we have
    measured map+rmdup throughputs of 13 Gbp / hour, and 19 Gbp / hour in map-only
    mode. Scalability tests show that the throughput per node is maintained as
    the number of nodes increases through to 128.

    We have been working on Seal to support the needs of the CRS4 Sequencing
    laboratory, which operates 6 Illumina sequencing machines and thus generates
    lots of data to process. The regular workflow was being overwhelmed
    notwithstanding the increased number of computers made available and was
    regularly overloading our Lustre shared storage volume. Now all
    data processing at the lab starts with Seal, with very positive results with
    respect to speed and maintenance effort.

    In case you were wondering, Hadoop (http://hadoop.apache.org/) is an open
    source, distributed, and robust MapReduce framework for data-intensive
    processing, providing a distributed computing system and a distributed file
    system.

    We're eager to get people to try our new tool. Please visit the Seal web site
    (http://biodoop-seal.sourceforge.net/) and feel free to contact myself or the
    other Seal authors if you have any question or problems.

    --
    Luca Pireddu
    CRS4 - Distributed Computing Group
    Loc. Pixina Manna Edificio 1
    Pula 09010 (CA), Italy
    Tel: +39 0709250452

  • #2
    Work Flow with oozie

    Hi Luca
    Thank you for sharing. Since this is in a hadoop cluster, can it be put into oozie?
    An Tat

    Comment


    • #3
      Hi An Tat,

      although we've never tried, I don't see why it wouldn't work. Actually, if you do try to use the Seal tools with Oozie I'd be quite interested in hearing about your experience. At the very least it could be something we add to the documentation.

      Are you already using Oozie?

      Luca

      ps: although we haven't announced them here on SEQanswers, we've had several releases of Seal since 0.1.0 and have added several tools to the suite. See http://biodoop-seal.sourceforge.net/ for the details.

      Comment


      • #4
        Hi Luca

        Great tool. Any plans to support novoalign in SEAL? We think it would be a great addition to your toolset and we can make a good case why it should be added as an alternative to BWA especially for Illumina/Ion Torrent and SOLiD reads.

        Private message me for more details on how we can get you full access from the aligner available at www.novocraft.com.

        Comment


        • #5
          Hi Luca,

          I'm trying to setup the latest SEAL version 0.3.0 . What is the version of Python, boost , Pydoop, Protobuf and Java JDK did you use for this latest SEAL ?

          Comment


          • #6
            java.io.IOException: pipe child exception

            Hi Luca,

            Encountered the error shown in the text attachment. I'm using
            1. hadoop-0.20.2
            2. pydoop-0.4.0_rc2
            3. Python2.6
            4. Protobuf-2.4.1
            5. seal-0.1.0
            6. boost_1_48_0
            7. biopython-1.59

            Tested wordcount example without error but run_seqal.sh with error.

            Jack
            Attached Files

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X