Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question about next step in a de novo bacterial genome assembly

    I am trying to de novo assemble a large bacterial genome (~9.2MB) with a high GC content (~67%). We have paired end data from a single miSeq run. Using a couple different combinations of programs (SPAdes and A5) we have been able to assemble our data into contigs (~700-900). Obviously, we have gaps and are very likely missing some regions of the genome as our contigs span ~8.8MB. I am new to genome sequencing and do not want to cut corners. At the same time, I would like to avoid unnecessary costs if possible for this assembly. From what I understand, I see two options:

    A. More short read data. We could do an additional miSeq run starting at the library prep stage or using excess DNA saved after the library prep was done. This would provide more short read data, but I am unsure if doing this will only give reads similar to before. Does anyone have experience with this? Is a second run likely to only sequence the same regions as the first time or can we expect to get data on previously unsequenced regions with an additional run?

    B. Long-read. This will help join contigs into scaffolds and hopefully a full genome, but we will likely have very low coverage/inaccuracies for those areas of the genome that the miSeq missed.

    Any recommendations on if A or B should be sufficient for a genome assembly given where we are at or will both be necessary? Thanks for the help!

  • #2
    With low-coverage long read data, you can also correct the long reads with the Illumina reads.

    It's unlikely that you are completely missing coverage of 400kbp of your genome. Rather, Illumina reads are too short to resolve many types of repeats, so they tend to get collapsed or broken into tiny contigs short enough that they were ignored for the purpose of statistics. It is unlikely that additional short-read coverage would help you (though in order to best determine that, you'd need to post the coverage distribution of the assembly as a result of mapping).

    Currently, we use either exclusively Illumina or exclusively PacBio for microbe assemblies so I don't really know much about the current best state of hybrid assemblies, but assembling a bacteria into 1 perfect contig with pure PacBio is pretty easy. That said, 9.2 Mbp is huge so maybe it would take ~4 Smrt cells for a pure PacBio assembly...

    P.S. You can often improve a Spades assembly by preprocessing the Illumina data in various ways (error-correction, read merging, read extension, duplicate removal, quality-filtering, etc), which is certainly the cheapest approach. Though it won't give you a single-contig assembly.
    Last edited by Brian Bushnell; 07-28-2017, 11:16 AM.

    Comment


    • #3
      Thanks!

      Our output from A5 says we have 460 scaffolds with a median coverage of 38X. The 10th percentile coverage is 20X. Our Spades runs have given a median coverage of 15X when we open files in Bandage. Perhaps we need to do more preprocessing with Spades to get the outputs more consistent between programs. I’m not sure this is the info you asked for about the coverage distribution as a result of mapping.

      Unfortuantely, I do not have access to a PacBio system, but there is someone in the department who has done MinION and could help with that. I agree about the hybrid assembly. I have tried to look for a program to do this, but haven’t seen anyone really recommend anything. As I’ve been doing Illumina, and more of the exact same doesn’t sound like it will help, then perhaps a Mate Pair Library would complement what I already have.

      Comment


      • #4
        If you only have 15X coverage, more coverage would definitely help. If you have 38X... maybe. But coverage estimates from alignment are generally more trustworthy than what assemblers report. E.g.:

        Code:
        bbmap.sh in=reads.fq ref=assembly.fa covhist=covhist.txt covstats=covstats.txt ambig=all delcov=f
        ...then you can plot the histogram in Excel and see how much low-coverage area you have (that assembled).

        Long mate libraries are also useful in improving continuity, but can be more expensive and complicated to make. I'm not sure about the details; I've only heard that anecdotally (as in, that's the reason we moved away from long-mate libraries).

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        8 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        8 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        66 views
        0 likes
        Last Post seqadmin  
        Working...
        X