Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hadoop for human genome data

    Hello Everyone,

    How do we store the human genome data using Hadoop (chromosome level) so that we can perform processing (bio-algorithm computing) on the data using Hadoop clusters?

  • #2
    How one best stores the data is entirely dependent on how the actual cluster is constructed and what the nature of the algorithm is. If the cluster is essentially a cloud with slow IO then you'll approach this differently than with a HPC cluster with a faster local storage array. Also, if you just need to load the genome into memory for long computations then it doesn't really matter how you store it, that's not going to be the bottleneck.

    Comment


    • #3
      Hi Ryan,
      Thanks for your reply. We do have cluster of 30 machines with hadoop. The problem is we are planning to process the human genome project using hadoop. Here the data is in the form of BAM files. I know if I load the data to hdfs, it will automatically split it into chunks and store on the name nodes. Thats is the problem here. I couldn't split the data like that. Need to split the data chromosome wise so that we can perform bio algorithm computing on them.

      Can someone please give some insights on this

      Comment


      • #4
        Without knowing more detail it's impossible to give any guidance. Hadoop is a general tool to facilitate processing. How you should split things depends entirely on what you want to do with the results (and "bio algorithm computing" has absolutely no meaning).

        Comment


        • #5
          Bio algorithm computing : for instance bisulfite methylation extraction

          Comment


          • #6
            Yes, that's one of many possible but completely unrelated tasks. I've already responded to this on one of your biostars threads. Please don't cross post.

            Comment


            • #7
              Currently we use bismap ( python tool ). Is there a way to store the data chromosome wise on hadoop.and run the bismap tool command as map reduce jobs

              Comment


              • #8
                How this would be done would depend entirely on the cluster, but there's generally no single command (or simple series thereof) that would allow that. The traditional way to do this would be to simply tell BSMap's methylation extractor to just process a single chromosome (and then run that simultaneously with different chromosomes on different cores). You could simply do that in a fraction of the time it's take to implement a full hadoop-based solution.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 08:47 AM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                57 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X