Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • help needed for de novo hybrid assembly strategy

    I am working on a de novo sequencing project of a yeast of size ~20 mb. I have done a 500 bp library paired end 2*150 bp illumina miseq sequencing with 50x coverage.But several portions of the genome are missing. Now I am planning to do pacbio long read sequencing 2 SMRT cells for the missing regions. My question is if I use pacbiotoca for assembly, will the long reads which contain the missing regions be filtered out because there is no short reads to correct them?
    Can you suggest any alternate strategy? Assembly only with Pacbio reads is an option,but I think it requires a very high coverage of ~100x which is out of my budget.

  • #2
    Assembly with only PacBio currently requires around 100x for a single-contig bacterial assembly, but it still works with lower coverage, you'll just get a worse assembly. And chances are good that your genome is actually completely or almost completely covered with Illumina reads, just at some points they are too low coverage or the areas are too repetitive for assembly. So you may still be able to correct almost all of the PacBio data (though you'll end up with a lot less data than you started with because that process is inefficient), and thus, you may be able to get a fairly complete genome that way... though when we correct PacBio data with Illumina data, we almost always start with more than 50x Illumina. You might try estimating the genome size from the kmer frequency distribution of your Illumina data - for example, with BBNorm:
    khist.sh in=reads.fq hist=histogram.txt k=31

    This will give you a 31-mer depth distribution, from which you can manually determine the size of the genome. There are also tools to automate it (for example, AllPathsLG has one), though I don't know how well they work. If the estimated genome size based on kmer frequencies is the same as your expected genome size, then it is probably almost completely covered with Illumina data.

    When we do hybrid fungal assemblies, we often create contigs from an Illumina fragment library, then use PBJelly to fill captured gaps. This does require captured gaps, though (but I think there may be a recent version of PBJelly that works with uncaptured gaps).

    So, try scaffolding your data with your existing 500bp-insert library, and see how good the scaffolding is; if you end up with ~20MB of scaffolds, you should be able to just use PBJelly with PacBio data to fill them in. Otherwise a long mate pair library is useful for scaffolding prior to filling gaps with PBJelly, but that's expensive too.

    Comment


    • #3
      Thanks for your reply. I have read somewhere another strategy- producing assembly with the illumina and pacbio reads (low coverage) separately with gaps and then merge them with minimus2. Would it be a better approach?

      Comment


      • #4
        Using the latest version of PacBio's HGAP assembler I often see single contig bacterial assemblies at ~50x coverage with a sufficiently good long insert library. The PacBio human assembly (haploid) @54x has a contig N50 of 4.4Mb and a maximum contig of 44Mb.
        Given a 20Mb genome and ~350Mb per cell you should be able to hit this with 3-4 cells.
        For error correction of PacBio with Illumina I would recommend ECTools over PacBioToCA, it is a lot more computationally efficient.
        Separate assembly and merging can work quite well. For the PacBio assembly the latest version of HGAP (.3) allows self correction at lower coverage, but you need to be aware of the possibility of introducing missasemblies.

        Comment


        • #5
          Hi wrch,

          you may want to consider generating PacBio CCS reads, rather than CLR. The CCS reads have a much lower error rate (somehwhere between 1 and 3% usually). This comes at the expense of length, but generally they don't need any error correction.

          I am currently working on a MiSeq/PacBIO CCS dataset, and I have found out after a lot of experimentation that the best approach is this:

          - run the MiSeq reads through FLASH, a 3' read overlapper (this may not apply to you if your MiSeq reads don't overlap)
          - assemble the overlapped MiSeq reads with MSR-CA
          - do a meta-assembly of the MiSeq MSR-CA contigs and singletons with the unassembled PacBio reads, using CAP3 (old-style Sanger OLC assembler, works really well for this)

          The end result is that the PacBio reads complement the MiSeq MSR-CA contigs very nicely and connect these across gaps in many cases.

          cheers

          Micha

          Comment


          • #6
            Thanks Micha for your suggestion. Can you please tell me how much data is generated per SMRT cell . I have read that ~300 mb CLR reads generated per SMRT cell. Is it true for CCS reads also?

            Comment


            • #7
              Micha,
              That's an interesting approach, but isn't the read length limited to such an extent that this approach would never complete even relatively simple bacterial assemblies?
              http://genomebiology.com/2013/14/9/R101
              http://genomebiology.com/content/sup...9-r101-s2.html CCS read length could be approximated to C1 in this plot.
              The whole advantage of PacBio for assembly is long range information, which is lost when using CCS reads.
              Richard.
              Last edited by rhall; 05-19-2014, 09:16 AM.

              Comment


              • #8
                Originally posted by rhall View Post
                Micha,
                That's an interesting approach, but isn't the read length limited to such an extent that this approach would never complete even relatively simple bacterial assemblies?
                http://genomebiology.com/2013/14/9/R101
                http://genomebiology.com/content/sup...9-r101-s2.html CCS read length could be approximated to C1 in this plot.
                The whole advantage of PacBio for assembly is long range information, which is lost when using CCS reads.
                Richard.
                PacBio now generates "Reads of Insert" which are basically the same as CCS reads, they just name them differently for some reason. Anyway, we recently generated a bunch of these for 16s, averaging ~1500bp and mostly with accuracy of 95%-99%. PacBio reads have been getting longer quite rapidly. So, where that paper suggested shearing to 300bp - 800bp for CCS... I think now it would be better to target ~1500-2500bp if you want fairly high quality individual reads of insert. You could also, of course, target much longer reads and just assume that a lot will come out short.

                Comment


                • #9
                  Even with an optimal insert size CCS (Read of Insert) will not give you the long range information needed for assembly. To maximize throughput and get the best CCS yield at high number of passes the resulting read length distribution will be somewhere around the C1 distribution in that plot, at which very few bacterial assembles can be completed. To complete many relatively simple bacterial assembles you need long range information on the order of ~5kbp, >5kbp CCS reads are going to be rare.
                  I don't see a compelling use for CCS / Reads of Insert in assembly.

                  Comment


                  • #10
                    My experience of this has been that the CCS reads - even though they are short by comparison with CLR - complement the Illumina reads nicely and bridge a lot of the gaps between contigs that have low-complexity sequence at the ends (e.g. microsatellites, homopolymer runs), where the de novo assembly of the Illumina reads alone was insufficient. They may not be perfect for completing whole genomes, but they have certainly improved our assemblies substantially (we have been using this approach for assembly of R gene sequences from enrichment sequencing).

                    Comment


                    • #11
                      assembly merge

                      Originally posted by wrch View Post
                      Thanks for your reply. I have read somewhere another strategy- producing assembly with the illumina and pacbio reads (low coverage) separately with gaps and then merge them with minimus2. Would it be a better approach?
                      If you still need to do assembly merge, you can use GAM-NGS:

                      Background In recent years more than 20 assemblers have been proposed to tackle the hard task of assembling NGS data. A common heuristic when assembling a genome is to use several assemblers and then select the best assembly according to some criteria. However, recent results clearly show that some assemblers lead to better statistics than others on specific regions but are outperformed on other regions or on different evaluation measures. To limit these problems we developed GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing), whose primary goal is to merge two or more assemblies in order to enhance contiguity and correctness of both. GAM-NGS does not rely on global alignment: regions of the two assemblies representing the same genomic locus (called blocks) are identified through reads' alignments and stored in a weighted graph. The merging phase is carried out with the help of this weighted graph that allows an optimal resolution of local problematic regions. Results GAM-NGS has been tested on six different datasets and compared to other assembly reconciliation tools. The availability of a reference sequence for three of them allowed us to show how GAM-NGS is a tool able to output an improved reliable set of sequences. GAM-NGS is also a very efficient tool able to merge assemblies using substantially less computational resources than comparable tools. In order to achieve such goals, GAM-NGS avoids global alignment between contigs, making its strategy unique among other assembly reconciliation tools. Conclusions The difficulty to obtain correct and reliable assemblies using a single assembler is forcing the introduction of new algorithms able to enhance de novo assemblies. GAM-NGS is a tool able to merge two or more assemblies in order to improve contiguity and correctness. It can be used on all NGS-based assembly projects and it shows its full potential with multi-library Illumina-based projects. With more than 20 available assemblers it is hard to select the best tool. In this context we propose a tool that improves assemblies (and, as a by-product, perhaps even assemblers) by merging them and selecting the generating that is most likely to be correct.


                      Best,
                      Simone

                      Comment


                      • #12
                        Originally posted by wrch View Post
                        Can you suggest any alternate strategy? Assembly only with Pacbio reads is an option,but I think it requires a very high coverage of ~100x which is out of my budget.
                        You may try to use SPAdes for hybrid Illumina + PacBio assembly. It will happily use your PacBio data both for filling in unrepresented parts in your Illumina data and resolve repeats.

                        Comment


                        • #13
                          Originally posted by moistplus
                          Any advices for bigger genome ? (GB)
                          Look at PacBio's hybrid assembly page:

                          GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.


                          Given pre-existing Illumina-based assemblies and scaffolding with PacBio I haven't had much luck with PBJelly but (and this is ongoing work) I have hopes for the AHA program.

                          Comment


                          • #14
                            All.

                            Realistically it hard to tell which approach will work best. And to some extent it depends on how much time you have and the resources you already have on hand. In my case I will almost always have some sort of Illumina-based assembly first because we are an Illumina shop. Then if I get PacBio reads layering them on top of the Illumina assembly makes sense. But other people may get PacBio reads first, find out that they are not assembling 100% and so go out and get Illumina reads for scaffolding.

                            Comment


                            • #15
                              westerman: What is your issue with PBJelly? It is generally robust, given a good quality draft genome as input for gap filling. AHA is very old and no longer available / supported.

                              moistplus:
                              Hybrid assembly isn't very common, denovo scaffolding with pacbio reads is not generally recommended. Gap filling with PBJelly can work really well, but the input illumina assembly must be high quality. Assembling illumina data and PacBio data together can be successful, this is generally what people consider a hybrid assembly. Older methods (pacBioToCA, ECTools) corrected pacbio reads with illumina data, then assemble using standard OLC methodologies. Recent implementations dbg2olc, MaSuRCA use a much more efficient approach, generally bulding the illumina assembly graph before using pacbio data to resolve repeats in the graph.
                              By far the best results are from pacbio only denovo assembly.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X