SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
help needed for de novo hybrid assembly strategy wrch Pacific Biosciences 20 03-24-2016 03:03 PM
Best strategy for 6 Gb de novo genome assembly? pfeutry De novo discovery 0 04-22-2013 06:56 PM
Strategy for genome assembly mbseq Bioinformatics 5 10-25-2012 06:04 AM
De novo assembly strategy Wiseone De novo discovery 0 11-18-2010 08:30 AM
PubMed: A new strategy for genome assembly using short sequence reads and reduced rep Newsbot! Literature Watch 1 11-18-2010 12:52 AM

Reply
 
Thread Tools
Old 11-25-2015, 07:00 AM   #1
joneill4x
Junior Member
 
Location: Canada

Join Date: Nov 2015
Posts: 3
Default De novo genome assembly strategy

Assembling a genome de novo. I have:

10X coverage with PAC-BIO reads

100X coverage with Illumina short reads (150 bp paired-end reads)

20X coverage with long MiSeq reads (max length 800 bp)

Given what I have to work with, what would be the best strategy to assemble the genome and why?

Thank you,

Joe
joneill4x is offline   Reply With Quote
Old 11-27-2015, 01:34 AM   #2
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 833
Default

What's your approximate genome size? How repetitive is the genome? It's quite important to know things like that for trying to work out the best method.

I'm trying to work this out myself. I've got long reads from MinION sequencing at ~0.3X coverage, a de-novo assembled transcriptome that seems to have about the right size and number of genes, and can pull short-read illumina data from EBI. The estimated genome size is about 200Mbp, which probably excludes SPAdes from what might be able to do assembly in a reasonable time frame.

In general, this is [still] a fairly difficult problem, and one of the few areas of genetics that can still benefit from a huge computer cluster.

Last edited by gringer; 03-29-2016 at 12:18 PM. Reason: remove repetitive text
gringer is offline   Reply With Quote
Old 11-27-2015, 05:44 AM   #3
joneill4x
Junior Member
 
Location: Canada

Join Date: Nov 2015
Posts: 3
Default

Thanks Gringer.

Estimated genome size is quite large, 20Gb

After reading around, I have decided to try DBG2OLC.

What lead me there:
https://github.com/PacificBioscience...Bio-Long-Reads

The publication:
http://arxiv.org/ftp/arxiv/papers/1410/1410.2801.pdf

The code:
http://sourceforge.net/projects/dbg2olc/

I'll report back on how it turns out.
joneill4x is offline   Reply With Quote
Old 11-27-2015, 11:25 AM   #4
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 833
Default

Hmm, I like the "forget about error correction" approach. For the genome I'm working with there's at least one gene with at least three different copies in the genome that are all expressed, so error correction is likely to result in misassembly.

Unfortunately, the code's a little green. I'm not sure I trust code where the amount of commented out function exceeds the amount of non-functional comments:

Code:
...
	bool rc_match = 0;
	if (align_info_vec[0].ref_idx == -align_info_vec[1].ref_idx)
	{
		//if ((align_info_vec[0].max_match_qry < align_info_vec[1].min_match_qry))// || (align_info_vec[1].max_match_qry < align_info_vec[0].min_match_qry))//non overlap match
		if ((align_info_vec[0].max_match_ref < align_info_vec[1].min_match_ref))// || (align_info_vec[1].max_match_qry < align_info_vec[0].min_match_qry))//non overlap match
			{
			map<int, int> local_index_qry;
...
Emotive statements in the paper don't help either, particularly when I don't completely agree with them:

Quote:
Similar to Microsoft®Windows software to PC, the indispensability of genome assembly software to DNA sequencers is self-evident.
....
While these algorithms and software packages have indeed achieved significant advancements for the 3rd GS genome assembly, the somewhat ad-hoc and intricate approaches some of the packages use may lead to structural errors since the path may be spurious due to chimeric long reads
or may not exist due to limited coverage of the second generation sequencing.
It looks like it's in a state of heavy development, so will probably take at least a few months for the dust to settle and be useful.
gringer is offline   Reply With Quote
Old 03-29-2016, 06:22 AM   #5
joneill4x
Junior Member
 
Location: Canada

Join Date: Nov 2015
Posts: 3
Default Tried DBG2OLC

I'm quite pleased with the results of DBG2OLC.

I corresponded with the authors, managed to closely replicate the results from their paper, and made some pretty decent draft assemblies of my own with minimal data. Fast performance and good results.
joneill4x is offline   Reply With Quote
Old 03-29-2016, 01:01 PM   #6
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 833
Default

I've been getting segmentation faults, unfortunately. I expect that there's some assumptions made by their code that I am violating.
gringer is offline   Reply With Quote
Reply

Tags
assebly, de novo assembly, genome, illumina, pacbio

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:53 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO