Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The best genome de novo assembly software using hybrid data (Illumina, 454 & Sanger)?

    Hello everyone,

    I want to start a discussion about what is the best software for de novo assembly using hybrid sequencing data (Sanger, illumina, 454, PacBio, et. al. )

    It is well known that mixture insert length and read length will help assembly. With different sequencing platforms we can get different read length. However, few kinds of software support assembly using hybrid sequencing data.

    I'm de novo assembling planarian genome. The genome is big (~1.9Gb) and includes lots of repetitive sequences (repetitiveness ~ 66%). So, it is one of the most difficulty genome to be de novo assembled.

    I‘ve already get Illumina, 454 and Sanger data. And I try to use all of them in de novo assembly. In my experience, I tried Velvet, SOAPdenovo, Abyss, Allpath-lg, and I will try Celera. However, only Allpath-lg and Celera seem OK for hybrid data, but not so good.

    Is there anyone who is doing similar work as me, and also wants to use hybrid data to perform assembly? I expect to discuss with you!

  • #2
    Other Assembly programs

    Ray can also handle the assembly of multiple formats.

    Comment


    • #3
      You could try MSR-CA (http://www.genome.umd.edu/SR_CA_MANUAL.htm, the source code is here: ftp://ftp.genome.umd.edu/pub/MSR-CA/) too, if you get it up and running properly. I haven't managed to get it run properly on my complete dataset yet, it seems to have a couple of bottlenecks or weirdly designed code. I ran into memory problems with 1.3.3, and 1.4b have some perl scripts that is really slow (reduce_sr.pl have been running for 3-4 days now).

      The premise for MSR-CA is really interesting though, assemble Illumina reads into highly confident unitigs/contigs with a de Bruijn graph, which is then combined with other data (454, Sanger) in CA afterwards.

      Comment


      • #4
        The big question is whether there ever will be one tool for all (these) different datatypes. The different assembler out there are tailored to different sequencing platforms for good reasons. Short reads can not be assembled using an OLC-based approach; this was solved by implementing the de Bruijn Graph. Now that these short-read technologies reach 100 bases, and 150 on the MiSeq (and GaIIx, apparently), this might change, though.

        So, perhaps using the best assembler for each datatype, and then developing a merging strategy would be better? Getting the best contigs possible first, then merge them and scaffold them using the best scaffolder?

        In this respect, the MSR-CA approach is quite interesting.

        Comment


        • #5
          [QUOTE=flxlex;59886]Getting the best contigs possible first, then merge them and scaffold them using the best scaffolder? [QUOTE]

          In this case, the scaffolds maybe better, but not the contigs.

          I'm performing genome assembly with SOAPdenovo. This software can assemble illumina short reads in to contigs and then generate scaffolds with some extra long reads (such as 454 and sanger) - the similar procedure like you said.

          But in my work, the contigs from SOAPdenovo are always very short. So, that's why I want to find some software which can generate contigs with all those data. Maybe, we can get much better contigs.

          Comment


          • #6
            MIRA supports Illumina, 454, Sanger and Ion Torrent data. And I think Bastien is looking into PacBio as well.

            Comment


            • #7
              Has anyone tried the recent version of Cellera (7.0), allowing up to 2 billion reads? I have it running now with 700m reads of ~140 and some 454 and pretty eager to see how it turns out.

              Also, has anyone been able to get MSR-CA running. I downloaded version 4, but it seems to stop during the generation of super-reads stage.

              Comment


              • #8
                I'm started a couple of assemblies of only 454 reads (about 45 million and 85 million, respectively) with CA 7.0, but they are still at the scaffolding step, and I reckon they will run for a week or two more.

                I've gotten MSR-CA 1.4 to run properly, but only on bacterial datasets (the Rhodobacter one from GAGE). I've tried it on our Illumina reads too (we have 200 million reads or something, getting more in some weeks), but it used a really long time on the reduce_sr.pl step (about 2-3 weeks). I had to stop it before it finished. So it is possible, but I think the implementation of reduce_sr.pl is a bottleneck in using MSR-CA on larger datasets. I'll come back to you when I get some experience with our new Illumina reads (in 6 weeks time).

                Comment


                • #9
                  Here at Cofactor Genomics, we've seen limited success.
                  We have good results with transcript sequence. We preassembled ILMN and 454 reads separately and then brought them together with an OLC. Here's a case where we didn't even hit the entire genome (2.6 MB) until the hybrid assembly:





                  We are currently working on getting the same type of success with genomic sequence. Come see us at AGBT where we are presenting what does/doesn't work.

                  @Godevil
                  What kind of results are you getting on the Planarian assembly? How much sequence coverage do you have on each platform? We've done this recently and had a difficult time getting results.
                  Last edited by ians; 01-31-2012, 08:20 AM.

                  Comment


                  • #10
                    AGBT Poster

                    I thought I share with everyone our AGBT poster which outlines the success we had with consolidating multi-platform sequence to produce hybrid assemblies.
                    We outline our methods and conclusions to dealing with various types of genomes. Enjoy:

                    AGBT Poster
                    Last edited by ians; 03-29-2012, 06:34 AM.

                    Comment


                    • #11
                      Originally posted by ians View Post
                      I thought I share with everyone our AGBT poster which outlines the success we had with consolidating multi-platform sequence to produce hybrid assemblies.
                      We outline our methods and conclusions to dealing with various types of genomes. Enjoy:

                      https://docs.google.com/open?id=0BySV4NmVGJNfZTA4Mjg3MDEtMTAxMi00NGM0LTljOWEtYmM2N2ZjMThiZTNh
                      The link is broken.

                      Comment


                      • #12
                        Originally posted by vadim View Post
                        The link is broken.
                        oops. fixed!

                        Comment


                        • #13
                          Originally posted by ians View Post

                          @Godevil
                          What kind of results are you getting on the Planarian assembly? How much sequence coverage do you have on each platform? We've done this recently and had a difficult time getting results.

                          I cannot see your document.

                          Our genome assembly is bad. I think that's because of low GC content, big genome size and high repetitiveness.
                          I'm now taking a training course in BGI in China. I hope I can get some useful information.

                          Comment


                          • #14
                            question

                            Originally posted by Ole View Post
                            I'm started a couple of assemblies of only 454 reads (about 45 million and 85 million, respectively) with CA 7.0, but they are still at the scaffolding step, and I reckon they will run for a week or two more.

                            I've gotten MSR-CA 1.4 to run properly, but only on bacterial datasets (the Rhodobacter one from GAGE). I've tried it on our Illumina reads too (we have 200 million reads or something, getting more in some weeks), but it used a really long time on the reduce_sr.pl step (about 2-3 weeks). I had to stop it before it finished. So it is possible, but I think the implementation of reduce_sr.pl is a bottleneck in using MSR-CA on larger datasets. I'll come back to you when I get some experience with our new Illumina reads (in 6 weeks time).
                            which one step in using the reduce_sr.pl script? no information about it in the manul of this software

                            Comment


                            • #15
                              Originally posted by erhuangzi View Post
                              which one step in using the reduce_sr.pl script? no information about it in the manul of this software
                              The MSR-CA manual is pretty lacking, but this is the step where the program tries to find redundant super reads, and remove them. That's my guess at least.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X