Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jnfass
    Member
    • Aug 2008
    • 88

    gsAssembler / newbler hangs during (large?) assembly

    I'm wondering if anyone else has seen this behavior:

    I've started an assembly run several times, and each time it gets to about this point:
    Assembly computation starting at: Fri Dec 19 17:09:50 2008 (v1.1.03.24)
    Indexing reads...
    -> 1713539 reads, 408657446 bases.
    Setting up overlap detection...
    -> 1713305 of 1713305
    Building a tree for 31819079 seeds...
    Computing alignments...
    -> 1693905 of 1693905
    Detangling alignments...
    -> Level 3, Phase 8, Round 1...
    and then ... stalls. I think the "Phase" and "Round" and even "Level" values have been different each time, which makes me think that maybe it's still working on the data, but it's taking a lot longer than I expected ...
    I've got ~1.7M reads, ~250bp N50 ... and an assembly of ~1/10 of this data finished in maybe 15 minutes. But it's going on 65 hours now with the full data set ... unfortunately I don't know how recently the Level/Phase/Round have changed since newbler refreshes the same line in its output.

    Does this ring a bell to anyone? Should I just wait longer?

    Thanks,
    ~Joe
  • andpet
    Member
    • Jul 2008
    • 27

    #2
    newbler hangs

    Hi,

    I think you should wait (maybe for a week or so).

    First:
    1.7 M reads really are a lot of data and therefore the denovo assembly can take quite some time. For example for some assemblies I waited at least one week !!.

    Maybe you can use a faster computer ?


    Second:
    Is the genome you sequenced highly repetitive ? In this case it will take even longer. In your log you can see that newbler starts with looking for pairwise read overlaps. Next it will build contigs from these overlaps. This is the "detangling" phase since newbler tries to resolve repeats (due to repeats several reads overlap in many ways but only one is correct) and this is really time consuming. Another problem is that newbler needs for this step a lot of RAM. If you don't have enough the operating system will try to provide some virtual memory (memory on the hard disk) but using virtual memory is much slower then using RAM. This would slow down your process additionally.

    The more RAM the better ... :-)


    You could also use another assembler for example euler to get some larger contigs and then assemble them with newbler. Or mira ...

    By the way: In your newbler assembly directory there is a file 454NewblerProgress.txt where newbler reports every step (unfortunately without a run time or so) ...

    Cheers,

    Andreas

    Comment

    • jnfass
      Member
      • Aug 2008
      • 88

      #3
      Thanks Andreas! ... my run is finally in the "Building contigs/scaffolds" stage, so I guess I sounded the alarm too soon. The run's not RAM-limited, and it's running on a 2.8GHz processor, but I haven't looked very much at repeat content ... thanks for the suggestion. Does anyone know if newbler's going to become multi-threaded any time soon?

      Comment

      • hlu
        Member
        • Jan 2009
        • 32

        #4
        Hi Joe,

        I saw in another forum mentioning the sample is plants

        Your difficulty on assemling plants 454 data is expected. Plant sequences are highly repetitive. The 454 gsAssembly running time is porportional to the degree of repeats in the data set. Typically, for bacterial data of your size, it takes only couple of hours to finish. But for plants, it can go on to several days, or not finishing at all, and our of memory crash.

        Comment

        • westerman
          Rick Westerman
          • Jun 2008
          • 1104

          #5
          Another problem, although it probably is not the root cause, is mixing Titanium and non-titanium runs and software. I found that I had to specify the proper adapters via the '-v' option when mixing the two.

          The repetitive nature of plants is mostly likely your root cause.

          Comment

          • jnfass
            Member
            • Aug 2008
            • 88

            #6
            @westerman -
            thanks for the tip ... may well be a future concern, but not with this data set. I'm working on setting aside the reads with repeat content (or masking) and will try to post back here to confirm or challenge the repeat cause.

            But I have another concern about newbler that I'll post in the "de novo discovery" forum .. having to do with newbler apparently padding and offsetting (instead of aligning) SNPs ...

            Comment

            • hlu
              Member
              • Jan 2009
              • 32

              #7
              Originally posted by westerman View Post
              Another problem, although it probably is not the root cause, is mixing Titanium and non-titanium runs and software. I found that I had to specify the proper adapters via the '-v' option when mixing the two.

              The repetitive nature of plants is mostly likely your root cause.
              -v is vector trimming feature under gsAssembly (or gsMapper).

              Titanium is very long reads, some of which may contain adaptor sequence at tail portion of reads. -v will trim that in assembly or mapping.


              Usually this is not cause for speed slow down. But in samples where customized primers are dominant, primer sequences can slow down assembly dramatically. -v option can solve this problem by trimming off primers in assembly.

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              34 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              97 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              117 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              112 views
              0 reactions
              Last Post SEQadmin2  
              Working...