SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
De novo genome assembly (Hybrid) - bacterial genome MuG General 2 03-28-2017 04:13 PM
Automated pipeline for de novo Bacterial genome assembly Morgane_AUS Bioinformatics 0 10-22-2014 05:25 PM
Abyss de novo assembly question hakattack Bioinformatics 4 05-09-2013 07:54 AM
DNASTARís De Novo Bacterial Genome Assembly App Now Available on BaseSpace DNASTAR Vendor Forum 0 11-13-2012 08:22 AM
Genome Res De novo bacterial genome sequencing: millions of very short reads assembly b_seite Literature Watch 0 03-12-2008 01:12 AM

Reply
 
Thread Tools
Old 07-27-2017, 10:35 AM   #1
JodyFranke
Junior Member
 
Location: Nebraska

Join Date: Jul 2017
Posts: 2
Default Question about next step in a de novo bacterial genome assembly

I am trying to de novo assemble a large bacterial genome (~9.2MB) with a high GC content (~67%). We have paired end data from a single miSeq run. Using a couple different combinations of programs (SPAdes and A5) we have been able to assemble our data into contigs (~700-900). Obviously, we have gaps and are very likely missing some regions of the genome as our contigs span ~8.8MB. I am new to genome sequencing and do not want to cut corners. At the same time, I would like to avoid unnecessary costs if possible for this assembly. From what I understand, I see two options:

A. More short read data. We could do an additional miSeq run starting at the library prep stage or using excess DNA saved after the library prep was done. This would provide more short read data, but I am unsure if doing this will only give reads similar to before. Does anyone have experience with this? Is a second run likely to only sequence the same regions as the first time or can we expect to get data on previously unsequenced regions with an additional run?

B. Long-read. This will help join contigs into scaffolds and hopefully a full genome, but we will likely have very low coverage/inaccuracies for those areas of the genome that the miSeq missed.

Any recommendations on if A or B should be sufficient for a genome assembly given where we are at or will both be necessary? Thanks for the help!
JodyFranke is offline   Reply With Quote
Old 07-28-2017, 11:10 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

With low-coverage long read data, you can also correct the long reads with the Illumina reads.

It's unlikely that you are completely missing coverage of 400kbp of your genome. Rather, Illumina reads are too short to resolve many types of repeats, so they tend to get collapsed or broken into tiny contigs short enough that they were ignored for the purpose of statistics. It is unlikely that additional short-read coverage would help you (though in order to best determine that, you'd need to post the coverage distribution of the assembly as a result of mapping).

Currently, we use either exclusively Illumina or exclusively PacBio for microbe assemblies so I don't really know much about the current best state of hybrid assemblies, but assembling a bacteria into 1 perfect contig with pure PacBio is pretty easy. That said, 9.2 Mbp is huge so maybe it would take ~4 Smrt cells for a pure PacBio assembly...

P.S. You can often improve a Spades assembly by preprocessing the Illumina data in various ways (error-correction, read merging, read extension, duplicate removal, quality-filtering, etc), which is certainly the cheapest approach. Though it won't give you a single-contig assembly.

Last edited by Brian Bushnell; 07-28-2017 at 11:16 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-28-2017, 01:13 PM   #3
JodyFranke
Junior Member
 
Location: Nebraska

Join Date: Jul 2017
Posts: 2
Default

Thanks!

Our output from A5 says we have 460 scaffolds with a median coverage of 38X. The 10th percentile coverage is 20X. Our Spades runs have given a median coverage of 15X when we open files in Bandage. Perhaps we need to do more preprocessing with Spades to get the outputs more consistent between programs. I’m not sure this is the info you asked for about the coverage distribution as a result of mapping.

Unfortuantely, I do not have access to a PacBio system, but there is someone in the department who has done MinION and could help with that. I agree about the hybrid assembly. I have tried to look for a program to do this, but haven’t seen anyone really recommend anything. As I’ve been doing Illumina, and more of the exact same doesn’t sound like it will help, then perhaps a Mate Pair Library would complement what I already have.
JodyFranke is offline   Reply With Quote
Old 07-28-2017, 02:55 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

If you only have 15X coverage, more coverage would definitely help. If you have 38X... maybe. But coverage estimates from alignment are generally more trustworthy than what assemblers report. E.g.:

Code:
bbmap.sh in=reads.fq ref=assembly.fa covhist=covhist.txt covstats=covstats.txt ambig=all delcov=f
...then you can plot the histogram in Excel and see how much low-coverage area you have (that assembled).

Long mate libraries are also useful in improving continuity, but can be more expensive and complicated to make. I'm not sure about the details; I've only heard that anecdotally (as in, that's the reason we moved away from long-mate libraries).
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bacterial assembly, de novo

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:29 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO