Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parallel Processing for Sequence Analysis

    Hello,

    I'm fairly new here and have been trying to get our systems configured properly for NGS analysis. I'm primarily concerned with ABi CS data, but will also be involved quite heavily with Solexa as well. Corona has its own built-in tools for configuring they're applications to run on top of Torqure PBS for processing on a cluster, this seems to work quite well. I've been searching for other options and am not finding very much. Solexa's GAPipeline appears to have some basic tools for parallelization, but we're not big fans of ELAND and would prefer to use MAQ or Bowtie for alignments. These two tools don't seem to have much information on methods for batch job submission.

    I'm hoping to get some feedback from anyone with more experience, in ways to either parallelize MAQ, Bowtie, etc... or for ways to, at least, break up the jobs so that they can be submitted in a naively parallel fashion. Thanks in advance!

  • #2
    I'm probably the wrong person to attempt to answer your question, but as far as I know, we just run each lane through maq one at a time, then use mapmerge to assemble libraries back together. Thus, we often have eight maq jobs running at a time on the cluster, for each machine in operation. Again, I'm not the person who submits the jobs, so other people can probably provide more information than I can.

    Sequence alignment theoretically belongs to the class of algorithms known as embarrassingly parallelizable... each sequence could theoretically be aligned by a separate computer and then recombined. The question should just be what is the optimal number of reads to align by each instance... and that I dont' know. (-:
    The more you know, the more you know you don't know. —Aristotle

    Comment


    • #3
      Hm. The idea of separating lanes is good. I am familiar with most embarrassingly parallel methods for sequence analysis, but was hoping there might be some established methods specifically for NGS that have been developed. I am particularly interested in setting up a few processing pipelines that can be triggered (relatively automatically) and then run across our cluster system, then packaged up for post processing and results delivery.

      Tools like the corona pipeline are ideal because they are pre-configured to do so off the bat. MAQ would require some initial configuration and some scripts here and there to accomplish this. I guess a generic tool for parallelizing things may be too much to ask for, but aside from splitting up lanes, or splitting up each individual alignment task, I'm wondering what else might be able to work? Bowtie has methods for splitting up across multiple cores, using the '-p' option, and I would hope that this can somehow be leveraged to cross multiple systems as well. But that's where I start to get lost, and find myself trying to figure out the code at a much lower level, which is going to take me a very long time to solve...

      Comment


      • #4
        Hi jperin,

        With respect to Bowtie, the -p option allows you to parallelize Bowtie in the sense of using multiple threads (which are hopefully mapped to multiple processor cores) on a single machine. For parallelizing across machines, I do not really have a pre-fab set of scripts for that. As an aside, I'm currently doing some work on getting Bowtie to work in a Cloud Computing framework, specifically using Hadoop. This would allow Bowtie to be parallelized across any cluster that has Hadoop installed, including Amazon's EC2 service. That's not ready for prime time yet, though.

        Thanks,
        Ben

        Comment


        • #5
          MAQ on cluster

          A few comments here.
          Here is a nice trick posted by Quang.


          Hi Victor,
          We use "maq fastq2bfq -n 1000000 ..." to split the reads.
          ....

          Q

          More here.

          Comment


          • #6
            Originally posted by jperin View Post
            Tools like the corona pipeline are ideal because they are pre-configured to do so off the bat. MAQ would require some initial configuration and some scripts here and there to accomplish this. I guess a generic tool for parallelizing things may be too much to ask for, but aside from splitting up lanes, or splitting up each individual alignment task, I'm wondering what else might be able to work?
            As far as I know the Corona pipeline does not do anything fancy. All it does is to split up the alignment task using the chromosomes with one CPU per 'chromosome' (note that a 'chromosome' could be a single contig/BAC/etc. depending on your organism). If you have single chromosome then Corona will only use one CPU.

            I could be running Corona lite improperly in which case let me know! But my experience is that Corona does not employ anything more than the same-old-same-old embarrassingly parallel methods.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            47 views
            0 likes
            Last Post seqadmin  
            Working...
            X