Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pre-assembly for short-reads to minimize RAM usage

    Hello everybody!

    I'm looking forward to assembling de novo ~1-5 Gb of short reads from next-generation sequencer. Data is of metagenomic character, hundreds of species. The amount of RAM required by assembly program (Velvet, SOAPdenovo, etc.) for such analysis is few hundred Gb. Is there a known way to cluster the initial reads into associated related portions , so that assembly is performed in portions and RAM peak usage is decreased?

    Thanks ahead,
    Alex

  • #2
    yes this is the exact same question I am having in my mind.

    I have around 400 million of 36bp paired-end reads. I am in the process of trying to assemble them with velvet but I was wondering if the input is too huge and a preclustering step is needed.

    If yes then what type of clustering approach?

    thanks

    Comment


    • #3
      Hi.
      I think that the clustering must be made before sequencing (by selecting specific regions of the genome using enzymes for example) and then assemble each small data set indepentetly.

      The only way to reduce the amount of memory needed is perform an error correction step. The problem is that the error correction step may require more RAM than the de novo assembly.

      Francesco

      Comment


      • #4
        I came upon the following discussion:

        The idea is to pre-cluster kmers into non-overlapping de Brujin subgraphs and assemble them separately (using lower memory requirements), then combine the results.

        Comment


        • #5
          Thanks Alex8, that is quite an interesting discussion, seems like Curtain is worth looking at.

          Dear Francesco, I came across your post in this discussion thread.

          "...The trick usually is to work with a subset of 10% of the reads. Make multiple assemblyes of several random subsets and then merge toghether the results."

          Can you please explain more on "random subsets"? Say if we assemble 10% of our reads at a time, am I correct that we will end up with 10 separate sub-assembly results for assembly/scaffolding? Or the subsets are suppose to be random, where the same read can exist in more than one subset?

          thanks!

          Comment


          • #6
            The post is quite old and this approach was usufull do to the lack of software able to assembly mere than one lane.

            The idea was to PARTITION (here is your point) in 10 or less independent subsets the data and assembly each of these subset independently. This was but still is meaningful when the coverage is very high. If a Microbe is sequenced at an expected coverage of 800X then this approach is usufull.

            Francesco

            Originally posted by leeht View Post
            Thanks Alex8, that is quite an interesting discussion, seems like Curtain is worth looking at.

            Dear Francesco, I came across your post in this discussion thread.

            "...The trick usually is to work with a subset of 10% of the reads. Make multiple assemblyes of several random subsets and then merge toghether the results."

            Can you please explain more on "random subsets"? Say if we assemble 10% of our reads at a time, am I correct that we will end up with 10 separate sub-assembly results for assembly/scaffolding? Or the subsets are suppose to be random, where the same read can exist in more than one subset?

            thanks!

            Comment


            • #7
              Originally posted by Alex8 View Post
              Hello everybody!

              I'm looking forward to assembling de novo ~1-5 Gb of short reads from next-generation sequencer. Data is of metagenomic character, hundreds of species. The amount of RAM required by assembly program (Velvet, SOAPdenovo, etc.) for such analysis is few hundred Gb. Is there a known way to cluster the initial reads into associated related portions , so that assembly is performed in portions and RAM peak usage is decreased?

              Thanks ahead,
              Alex
              I think if yours is a metagenomic sample your ram requirement is likely to be large and I am guessing there will be low coverage per species / contig.

              if you can already cluster the reads by kmers then you can do mini assemblies using any programs.

              Have a look at Softgenetic's NextGene to do the clustering. It looks like something useful but I can't comment much as I have limited experience with it.
              http://kevin-gattaca.blogspot.com/

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              31 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              33 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Working...
              X