Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CloudBurst VS Bowtie

    Hi all,
    I am very interested in Hadoop applications in NGS. The well known project showed in Hadoop community is CloudBurst. But I saw an evaluation of it from a paper "Searching SNPs with Cloud Computing". It said, "CloudBurst is capable of reporting all alignments for millions of human short reads in minutes, but does not scale well to human resequencing applications involving billions of reads. Whereas CloudBurst aligns about 1 million short reads per minute on a 24-core cluster, a typical human resequencing project generates billions of reads, requiring more than 100 days of cluster time or a much larger cluster"
    . Why it claimed CloudBurst is not scalable? Crossbow in this paper adopted bowtie instead of Cloudburst for mapping short reads to the reference genome. I wanna know the reasons. In my opinion, Cloudburst is natively map-reduced, while bowtie does not, why the authors claimed such conclusion? Is there any solid comparison of these two short reads mapping tools? And if I just wanna map short reads to the reference genome, which should I take: Cloudburst or Crossbow without Soapsnp (Only the map step using Bowtie)?Thanks in advanced.
    Last edited by xinwu; 08-08-2010, 05:46 PM.

  • #2
    Originally posted by xinwu View Post
    Hi all,
    I am very interested in Hadoop applications in NGS. The well known project showed in Hadoop community is CloudBurst. But I saw an evaluation of it from a paper "Searching SNPs with Cloud Computing". It said, "CloudBurst is capable of reporting all alignments for millions of human short reads in minutes, but does not scale well to human resequencing applications involving billions of reads. Whereas CloudBurst aligns about 1 million short reads per minute on a 24-core cluster, a typical human resequencing project generates billions of reads, requiring more than 100 days of cluster time or a much larger cluster"
    . Why it claimed CloudBurst is not scalable? Crossbow in this paper adopted bowtie instead of Cloudburst for mapping short reads to the reference genome. I wanna know the reasons. In my opinion, Cloudburst is natively map-reduced, while bowtie does not, why the authors claimed such conclusion? Is there any solid comparison of these two short reads mapping tools? And if I just wanna map short reads to the reference genome, which should I take: Cloudburst or Crossbow without Soapsnp (Only the map step using Bowtie)?Thanks in advanced.
    Hi Xinwu,

    To be clear, both techniques are "scalable", in the sense that they both make good use of additional CPUs when they are added. (Granted: the authors only show experiments using a few dozen up to a few hundred CPU cores.) The problem with CloudBurst is that it's slower than Bowtie on a comparable number of cores. So the authors (I'm one of them, as is Mike Schatz, the author of CloudBurst) are saying that when CloudBurst is scaled to a *dataset* the size of a human resequencing dataset, it takes longer than researchers are willing to wait. I hope that's more clear. Frankly, we should probably have said "but takes a very long time to finish for" instead of "but does not scale well to".

    Ben

    Comment


    • #3
      Hi Ben,

      Thanks for the clarification. CloudBurst combines hadoop and RMAP, I guess maybe RMAP is the bottleneck of the speed. Is it possible to replace RMAP with Bowtie? I mean a hadoop version of Bowtie to do the large scale short reads mapping.

      Comment


      • #4
        In general, it is possible to swap different algorithms into cloud pipelines. In practice this takes some effort since programs' input and output formats might need to be changed, and you must consider whether the tool's memory footprint fits on a particular EC2 instance type, etc.

        Thanks,
        Ben

        Comment


        • #5
          Hi Ben,
          One more question , Bowtie is based on BWT and Cloudburst is based on seed-extended like RMAP. Is it true that seed-extended is higher sensitive and fewer limitation (say, allow gap and indel, etc) than other algorithms? If the only drawback is time consuming for seed extended method, it will be relatively easy to overcome in order to get more "accurate" or "flexible" result.

          Comment


          • #6
            how to read CloudBurst source code

            Hi Mike Schatz,
            Recently,I am reading cloudBurst source code,but it is too hard to read codes,because the CloudBurst has little source code comments,I wanna know the detail of implementation.would help me please?thanks!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            31 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X