Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CloudBurst VS Bowtie

    Hi all,
    I am very interested in Hadoop applications in NGS. The well known project showed in Hadoop community is CloudBurst. But I saw an evaluation of it from a paper "Searching SNPs with Cloud Computing". It said, "CloudBurst is capable of reporting all alignments for millions of human short reads in minutes, but does not scale well to human resequencing applications involving billions of reads. Whereas CloudBurst aligns about 1 million short reads per minute on a 24-core cluster, a typical human resequencing project generates billions of reads, requiring more than 100 days of cluster time or a much larger cluster"
    . Why it claimed CloudBurst is not scalable? Crossbow in this paper adopted bowtie instead of Cloudburst for mapping short reads to the reference genome. I wanna know the reasons. In my opinion, Cloudburst is natively map-reduced, while bowtie does not, why the authors claimed such conclusion? Is there any solid comparison of these two short reads mapping tools? And if I just wanna map short reads to the reference genome, which should I take: Cloudburst or Crossbow without Soapsnp (Only the map step using Bowtie)?Thanks in advanced.
    Last edited by xinwu; 08-08-2010, 05:46 PM.

  • #2
    Originally posted by xinwu View Post
    Hi all,
    I am very interested in Hadoop applications in NGS. The well known project showed in Hadoop community is CloudBurst. But I saw an evaluation of it from a paper "Searching SNPs with Cloud Computing". It said, "CloudBurst is capable of reporting all alignments for millions of human short reads in minutes, but does not scale well to human resequencing applications involving billions of reads. Whereas CloudBurst aligns about 1 million short reads per minute on a 24-core cluster, a typical human resequencing project generates billions of reads, requiring more than 100 days of cluster time or a much larger cluster"
    . Why it claimed CloudBurst is not scalable? Crossbow in this paper adopted bowtie instead of Cloudburst for mapping short reads to the reference genome. I wanna know the reasons. In my opinion, Cloudburst is natively map-reduced, while bowtie does not, why the authors claimed such conclusion? Is there any solid comparison of these two short reads mapping tools? And if I just wanna map short reads to the reference genome, which should I take: Cloudburst or Crossbow without Soapsnp (Only the map step using Bowtie)?Thanks in advanced.
    Hi Xinwu,

    To be clear, both techniques are "scalable", in the sense that they both make good use of additional CPUs when they are added. (Granted: the authors only show experiments using a few dozen up to a few hundred CPU cores.) The problem with CloudBurst is that it's slower than Bowtie on a comparable number of cores. So the authors (I'm one of them, as is Mike Schatz, the author of CloudBurst) are saying that when CloudBurst is scaled to a *dataset* the size of a human resequencing dataset, it takes longer than researchers are willing to wait. I hope that's more clear. Frankly, we should probably have said "but takes a very long time to finish for" instead of "but does not scale well to".

    Ben

    Comment


    • #3
      Hi Ben,

      Thanks for the clarification. CloudBurst combines hadoop and RMAP, I guess maybe RMAP is the bottleneck of the speed. Is it possible to replace RMAP with Bowtie? I mean a hadoop version of Bowtie to do the large scale short reads mapping.

      Comment


      • #4
        In general, it is possible to swap different algorithms into cloud pipelines. In practice this takes some effort since programs' input and output formats might need to be changed, and you must consider whether the tool's memory footprint fits on a particular EC2 instance type, etc.

        Thanks,
        Ben

        Comment


        • #5
          Hi Ben,
          One more question , Bowtie is based on BWT and Cloudburst is based on seed-extended like RMAP. Is it true that seed-extended is higher sensitive and fewer limitation (say, allow gap and indel, etc) than other algorithms? If the only drawback is time consuming for seed extended method, it will be relatively easy to overcome in order to get more "accurate" or "flexible" result.

          Comment


          • #6
            how to read CloudBurst source code

            Hi Mike Schatz,
            Recently,I am reading cloudBurst source code,but it is too hard to read codes,because the CloudBurst has little source code comments,I wanna know the detail of implementation.would help me please?thanks!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin


              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            39 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            41 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            35 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X