Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • De novo genome assembly strategy

    Assembling a genome de novo. I have:

    10X coverage with PAC-BIO reads

    100X coverage with Illumina short reads (150 bp paired-end reads)

    20X coverage with long MiSeq reads (max length 800 bp)

    Given what I have to work with, what would be the best strategy to assemble the genome and why?

    Thank you,

    Joe

  • #2
    What's your approximate genome size? How repetitive is the genome? It's quite important to know things like that for trying to work out the best method.

    I'm trying to work this out myself. I've got long reads from MinION sequencing at ~0.3X coverage, a de-novo assembled transcriptome that seems to have about the right size and number of genes, and can pull short-read illumina data from EBI. The estimated genome size is about 200Mbp, which probably excludes SPAdes from what might be able to do assembly in a reasonable time frame.

    In general, this is [still] a fairly difficult problem, and one of the few areas of genetics that can still benefit from a huge computer cluster.
    Last edited by gringer; 03-29-2016, 12:18 PM. Reason: remove repetitive text

    Comment


    • #3
      Thanks Gringer.

      Estimated genome size is quite large, 20Gb

      After reading around, I have decided to try DBG2OLC.

      What lead me there:
      GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.


      The publication:


      The code:
      Download DBG2OLC for free. DBG2OLC: Efficient Assembly of Large Genomes Using the Compressed Overlap Graph The source code can be found in the code page. To compile, go to the directory with all the code files, and use: g++ -o DBG2OLC -O3 *.cpp


      I'll report back on how it turns out.

      Comment


      • #4
        Hmm, I like the "forget about error correction" approach. For the genome I'm working with there's at least one gene with at least three different copies in the genome that are all expressed, so error correction is likely to result in misassembly.

        Unfortunately, the code's a little green. I'm not sure I trust code where the amount of commented out function exceeds the amount of non-functional comments:

        Code:
        ...
        	bool rc_match = 0;
        	if (align_info_vec[0].ref_idx == -align_info_vec[1].ref_idx)
        	{
        		//if ((align_info_vec[0].max_match_qry < align_info_vec[1].min_match_qry))// || (align_info_vec[1].max_match_qry < align_info_vec[0].min_match_qry))//non overlap match
        		if ((align_info_vec[0].max_match_ref < align_info_vec[1].min_match_ref))// || (align_info_vec[1].max_match_qry < align_info_vec[0].min_match_qry))//non overlap match
        			{
        			map<int, int> local_index_qry;
        ...
        Emotive statements in the paper don't help either, particularly when I don't completely agree with them:

        Similar to Microsoft®Windows software to PC, the indispensability of genome assembly software to DNA sequencers is self-evident.
        ....
        While these algorithms and software packages have indeed achieved significant advancements for the 3rd GS genome assembly, the somewhat ad-hoc and intricate approaches some of the packages use may lead to structural errors since the path may be spurious due to chimeric long reads
        or may not exist due to limited coverage of the second generation sequencing.
        It looks like it's in a state of heavy development, so will probably take at least a few months for the dust to settle and be useful.

        Comment


        • #5
          Tried DBG2OLC

          I'm quite pleased with the results of DBG2OLC.

          I corresponded with the authors, managed to closely replicate the results from their paper, and made some pretty decent draft assemblies of my own with minimal data. Fast performance and good results.

          Comment


          • #6
            I've been getting segmentation faults, unfortunately. I expect that there's some assumptions made by their code that I am violating.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X