Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bowtie and Clustering question.

    Hi Group,
    I am a relative newbie tying to come upto speed. So I managed to assemble a 20 core cluster and am just beginning to figure out how to work the bioinformatics assembly algorithms. So my scenario is this

    1) I currently have a WES raw data file measuring 5 GB. I have a quality score file which is approximately 12 GB.

    2) I have a four node AMD cluster with 32 GB RAM. I installed and configured Rocks software on the same.

    3) I have been looking into Bowtie to do the analysis on this cluster.

    Some questions which come to my mind are as follows

    1) How and where do I start?

    2) Is it possible to install bowtie on the ROCKS cluster such that I can use the 4 nodes to run the analysis in parallel?

    3) For this single massive file of 5 GB raw reads, how do I go about doing the assembly?

    4) With bowtie, am I restricted to using only ONE node on which to run the analysis on?

    5) OR, can I split my raw reads of file X4 and farm out each file to each one of the nodes and do the assembly and then do a final assembly of all the four assembled files?

    6) Has anyone installed Galaxy tools on a ROCKS cluster? Could you share your experiences of the same?

    I realize these are very basic and fundamental questions. But I would highly appreciate an answer. Hopefully I will be able to answer these questions on the forum in the near future.
    Regards
    Quantrix

  • #2
    Howdy, I'm new here, but I do parallel for a living. (hpc type)

    I can't speak to some of the things you've asked, but I have installed pMap and bowtie for customers for things such as this. I'd recommend pMap for simplicity. Either way you should be able to get every core on every node working with bowtie in parallel. IO will more than likely be your limiting factor then.

    Here is some information from The Ohio State University College of Medicine I wanted to share with you.




    pmap is MPI based, so if you have an interconnect (eth, ib, quadrics,myri, etc) and some type of MPI installed you should be good. pMap supports BWA, SOAP, Bowtie, GSNAP, MAQ and RMAP.

    crossbow is Hadoop based. I can't say I've seen hadoop on rocks (not a fan of rocks myself, but it is an excellent way to start with clusters) but it is possible. I'd be REALLY surprised if no one has ever done it as there are some rather decent sized clusters out there (TACC, PNNL) using rocks. I'd search for a hadoop roll. I'd be willing to bet it's out there.

    hpc

    Comment


    • #3
      There is some discussion of running Galaxy on ROCKS in this Galaxy-dev thread from this January.

      Comment


      • #4
        Hi hpcguy and Tnab,
        Thanks for the replies. I shall look into pMap right away. It sounds like one possible solution for me to start exploring.

        @hpcguy,
        You say you are not a fan of Rocks. I have had to wrestle with quite a few issues in getting it upto speed due to a combination of factors. However, it is running smoothly now. I was wondering if I should not go ahead and use something like plain CentOS and install other stuff separately. What is your take on this? Do you have a favorite and why? I was also looking into Ubuntu with Kerrighed as one option. (Ubuntu enterprise maybe?)
        Problem is there is not very much out there in terms of leads of how to go about clustering. If at all.

        Comment


        • #5
          the following is an example of how to run bowtie on multiple nodes... will require splitting the .fastq file, then reassembling the .sam in the end.
          First see how many reads you have.

          "cat yourfile.fastq | echo $((`wc -l`/4))"

          the result was = 14901431, so create two jobs in this case to run on two different nodes
          of the rocks cluster. I created a few .sh scripts... and just keep editing them for each different job. "nano bowtie_script_1.sh"... then edit as follows:

          #!/bin/bash
          #
          #$ -S /bin/bash
          bowtie -m 1 -S -p 4 -s 0 --qupto 7450715 share/apps/bowtie-1.0.0/indexes/hg19 yourfile.fastq

          second job will have different start and finish... split as many times as nodes you want to run it on.. this example uses 2 nodes.
          second script: "nano bowtie_script_2.sh"... then edit as follows:
          #!/bin/bash
          #
          #$ -S /bin/bash
          bowtie -m 1 -S -p 4 -s 7450715 --qupto 14901431 share/apps/bowtie-1.0.0/indexes/hg19 yourfile.fastq

          If you have bowtie installed correctly, you can then run the following:

          qsub bowtie_script_1.sh
          qsub bowtie_script_2.sh

          this will result in two files in .SAM format

          bowtie_script_1.sh.o##
          bowtie_script_2.sh.o##

          you would then need to join the two outputs into one .SAM file.

          "cat bowtie_script_1.sh.o## <(grep -v '^@' bowtie_script_2.sh.o##) > merged_sam.sam"

          Install of bowtie...

          to make it available to all of your compute nodes, install it into the /export/apps/ folder, which will make it available to all of your nodes.

          then edit the "/etc/skel/.bash_profile" PATH to include ":/share/apps/bowtie-1.0.0"

          if you run these jobs using qsub.. if it error's out, it will create an error file in your home directory.. which will point you into the right direction.

          good luck.

          Comment


          • #6
            Originally posted by hpcguy View Post
            Howdy, I'm new here, but I do parallel for a living. (hpc type)

            I can't speak to some of the things you've asked, but I have installed pMap and bowtie for customers for things such as this. I'd recommend pMap for simplicity. Either way you should be able to get every core on every node working with bowtie in parallel. IO will more than likely be your limiting factor then.

            Here is some information from The Ohio State University College of Medicine I wanted to share with you.




            pmap is MPI based, so if you have an interconnect (eth, ib, quadrics,myri, etc) and some type of MPI installed you should be good. pMap supports BWA, SOAP, Bowtie, GSNAP, MAQ and RMAP.

            crossbow is Hadoop based. I can't say I've seen hadoop on rocks (not a fan of rocks myself, but it is an excellent way to start with clusters) but it is possible. I'd be REALLY surprised if no one has ever done it as there are some rather decent sized clusters out there (TACC, PNNL) using rocks. I'd search for a hadoop roll. I'd be willing to bet it's out there.

            hpc
            I suppose pMap will work flawlessly on a Rocks cluster based on SGE right?
            It supports bowtie, does it also supports bowtie2?

            Thanks.

            Comment


            • #7
              Howdy. To all the folks that have sent me Private Messages about this: please set up your mailbox such that I can reply. I cannot answer your questions without a way to reach you. thanks.

              H

              Comment


              • #8
                Rocks is fantastic when a group/person/dept is starting out. No bones about it. Fantastic. Roll it out on a single rack in 10 min if you just give it a go. Be up and running apps in 15 min (with data being available). Not much beats this. Even AWS takes more work to configure. I've personally installed it and had a 2 rack cluster up and running from turn on in under 30 minutes and was running batch jobs. But the cluster was NEVER supposed to run another application ever again.

                The problem becomes as soon as there is a move into a more intermediate need/area. Rocks does not lend itself to being as flexible as needed for simplicity in advanced work. Moving to stock CentOS or Scientific Linux, RHEL, Ubuntu LTS,etc becomes a large step that can be intimidating but long term most folks that I've spoke or worked with look back and say they were glad they made the move.

                I would recommend making the change to something else when you feel Rocks just is too restrictive or you need more than you can find in the normal Rolls, etc.

                Originally posted by quantrix View Post
                Hi hpcguy and Tnab,
                Thanks for the replies. I shall look into pMap right away. It sounds like one possible solution for me to start exploring.

                @hpcguy,
                You say you are not a fan of Rocks. I have had to wrestle with quite a few issues in getting it upto speed due to a combination of factors. However, it is running smoothly now. I was wondering if I should not go ahead and use something like plain CentOS and install other stuff separately. What is your take on this? Do you have a favorite and why? I was also looking into Ubuntu with Kerrighed as one option. (Ubuntu enterprise maybe?)
                Problem is there is not very much out there in terms of leads of how to go about clustering. If at all.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X