Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Godevil
    Member
    • Feb 2011
    • 22

    The best genome de novo assembly software using hybrid data (Illumina, 454 & Sanger)?

    Hello everyone,

    I want to start a discussion about what is the best software for de novo assembly using hybrid sequencing data (Sanger, illumina, 454, PacBio, et. al. )

    It is well known that mixture insert length and read length will help assembly. With different sequencing platforms we can get different read length. However, few kinds of software support assembly using hybrid sequencing data.

    I'm de novo assembling planarian genome. The genome is big (~1.9Gb) and includes lots of repetitive sequences (repetitiveness ~ 66%). So, it is one of the most difficulty genome to be de novo assembled.

    I‘ve already get Illumina, 454 and Sanger data. And I try to use all of them in de novo assembly. In my experience, I tried Velvet, SOAPdenovo, Abyss, Allpath-lg, and I will try Celera. However, only Allpath-lg and Celera seem OK for hybrid data, but not so good.

    Is there anyone who is doing similar work as me, and also wants to use hybrid data to perform assembly? I expect to discuss with you!
  • severin
    Genome Informatics Facility
    • Sep 2009
    • 105

    #2
    Other Assembly programs

    Ray can also handle the assembly of multiple formats.

    Comment

    • Ole
      Member
      • Oct 2011
      • 17

      #3
      You could try MSR-CA (http://www.genome.umd.edu/SR_CA_MANUAL.htm, the source code is here: ftp://ftp.genome.umd.edu/pub/MSR-CA/) too, if you get it up and running properly. I haven't managed to get it run properly on my complete dataset yet, it seems to have a couple of bottlenecks or weirdly designed code. I ran into memory problems with 1.3.3, and 1.4b have some perl scripts that is really slow (reduce_sr.pl have been running for 3-4 days now).

      The premise for MSR-CA is really interesting though, assemble Illumina reads into highly confident unitigs/contigs with a de Bruijn graph, which is then combined with other data (454, Sanger) in CA afterwards.

      Comment

      • flxlex
        Moderator
        • Nov 2008
        • 412

        #4
        The big question is whether there ever will be one tool for all (these) different datatypes. The different assembler out there are tailored to different sequencing platforms for good reasons. Short reads can not be assembled using an OLC-based approach; this was solved by implementing the de Bruijn Graph. Now that these short-read technologies reach 100 bases, and 150 on the MiSeq (and GaIIx, apparently), this might change, though.

        So, perhaps using the best assembler for each datatype, and then developing a merging strategy would be better? Getting the best contigs possible first, then merge them and scaffold them using the best scaffolder?

        In this respect, the MSR-CA approach is quite interesting.

        Comment

        • Godevil
          Member
          • Feb 2011
          • 22

          #5
          [QUOTE=flxlex;59886]Getting the best contigs possible first, then merge them and scaffold them using the best scaffolder? [QUOTE]

          In this case, the scaffolds maybe better, but not the contigs.

          I'm performing genome assembly with SOAPdenovo. This software can assemble illumina short reads in to contigs and then generate scaffolds with some extra long reads (such as 454 and sanger) - the similar procedure like you said.

          But in my work, the contigs from SOAPdenovo are always very short. So, that's why I want to find some software which can generate contigs with all those data. Maybe, we can get much better contigs.

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            MIRA supports Illumina, 454, Sanger and Ion Torrent data. And I think Bastien is looking into PacBio as well.

            Comment

            • SLB
              Member
              • Sep 2010
              • 21

              #7
              Has anyone tried the recent version of Cellera (7.0), allowing up to 2 billion reads? I have it running now with 700m reads of ~140 and some 454 and pretty eager to see how it turns out.

              Also, has anyone been able to get MSR-CA running. I downloaded version 4, but it seems to stop during the generation of super-reads stage.

              Comment

              • Ole
                Member
                • Oct 2011
                • 17

                #8
                I'm started a couple of assemblies of only 454 reads (about 45 million and 85 million, respectively) with CA 7.0, but they are still at the scaffolding step, and I reckon they will run for a week or two more.

                I've gotten MSR-CA 1.4 to run properly, but only on bacterial datasets (the Rhodobacter one from GAGE). I've tried it on our Illumina reads too (we have 200 million reads or something, getting more in some weeks), but it used a really long time on the reduce_sr.pl step (about 2-3 weeks). I had to stop it before it finished. So it is possible, but I think the implementation of reduce_sr.pl is a bottleneck in using MSR-CA on larger datasets. I'll come back to you when I get some experience with our new Illumina reads (in 6 weeks time).

                Comment

                • ians
                  Member
                  • Aug 2011
                  • 53

                  #9
                  Here at Cofactor Genomics, we've seen limited success.
                  We have good results with transcript sequence. We preassembled ILMN and 454 reads separately and then brought them together with an OLC. Here's a case where we didn't even hit the entire genome (2.6 MB) until the hybrid assembly:





                  We are currently working on getting the same type of success with genomic sequence. Come see us at AGBT where we are presenting what does/doesn't work.

                  @Godevil
                  What kind of results are you getting on the Planarian assembly? How much sequence coverage do you have on each platform? We've done this recently and had a difficult time getting results.
                  Last edited by ians; 01-31-2012, 08:20 AM.

                  Comment

                  • ians
                    Member
                    • Aug 2011
                    • 53

                    #10
                    AGBT Poster

                    I thought I share with everyone our AGBT poster which outlines the success we had with consolidating multi-platform sequence to produce hybrid assemblies.
                    We outline our methods and conclusions to dealing with various types of genomes. Enjoy:

                    AGBT Poster
                    Last edited by ians; 03-29-2012, 06:34 AM.

                    Comment

                    • vadim
                      Member
                      • Sep 2009
                      • 37

                      #11
                      Originally posted by ians View Post
                      I thought I share with everyone our AGBT poster which outlines the success we had with consolidating multi-platform sequence to produce hybrid assemblies.
                      We outline our methods and conclusions to dealing with various types of genomes. Enjoy:

                      https://docs.google.com/open?id=0BySV4NmVGJNfZTA4Mjg3MDEtMTAxMi00NGM0LTljOWEtYmM2N2ZjMThiZTNh
                      The link is broken.

                      Comment

                      • ians
                        Member
                        • Aug 2011
                        • 53

                        #12
                        Originally posted by vadim View Post
                        The link is broken.
                        oops. fixed!

                        Comment

                        • Godevil
                          Member
                          • Feb 2011
                          • 22

                          #13
                          Originally posted by ians View Post

                          @Godevil
                          What kind of results are you getting on the Planarian assembly? How much sequence coverage do you have on each platform? We've done this recently and had a difficult time getting results.

                          I cannot see your document.

                          Our genome assembly is bad. I think that's because of low GC content, big genome size and high repetitiveness.
                          I'm now taking a training course in BGI in China. I hope I can get some useful information.

                          Comment

                          • erhuangzi
                            Junior Member
                            • Feb 2012
                            • 3

                            #14
                            question

                            Originally posted by Ole View Post
                            I'm started a couple of assemblies of only 454 reads (about 45 million and 85 million, respectively) with CA 7.0, but they are still at the scaffolding step, and I reckon they will run for a week or two more.

                            I've gotten MSR-CA 1.4 to run properly, but only on bacterial datasets (the Rhodobacter one from GAGE). I've tried it on our Illumina reads too (we have 200 million reads or something, getting more in some weeks), but it used a really long time on the reduce_sr.pl step (about 2-3 weeks). I had to stop it before it finished. So it is possible, but I think the implementation of reduce_sr.pl is a bottleneck in using MSR-CA on larger datasets. I'll come back to you when I get some experience with our new Illumina reads (in 6 weeks time).
                            which one step in using the reduce_sr.pl script? no information about it in the manul of this software

                            Comment

                            • Ole
                              Member
                              • Oct 2011
                              • 17

                              #15
                              Originally posted by erhuangzi View Post
                              which one step in using the reduce_sr.pl script? no information about it in the manul of this software
                              The MSR-CA manual is pretty lacking, but this is the step where the program tries to find redundant super reads, and remove them. That's my guess at least.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...