SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
illumina/454 de novo hybrid cDNA assembly with newbler2.6 Seqasaurus Bioinformatics 2 01-23-2012 08:20 AM
PubMed: iAssembler: a package for de novo assembly of Roche-454/Sanger transcriptome Newsbot! Literature Watch 0 11-25-2011 06:10 AM
hybrid assembly Illumina/454 Robby Bioinformatics 1 09-01-2011 12:54 AM
De novo hybrid assembly of 454/illumina : CLC workbench Bardj Bioinformatics 1 11-21-2010 04:14 PM
Discussion about MIRA hybrid assembly of 454 reads with Illumina unpaired data edge De novo discovery 5 11-16-2009 01:17 AM

Reply
 
Thread Tools
Old 12-16-2011, 06:22 AM   #1
Godevil
Member
 
Location: Japan

Join Date: Feb 2011
Posts: 22
Default The best genome de novo assembly software using hybrid data (Illumina, 454 & Sanger)?

Hello everyone,

I want to start a discussion about what is the best software for de novo assembly using hybrid sequencing data (Sanger, illumina, 454, PacBio, et. al. )

It is well known that mixture insert length and read length will help assembly. With different sequencing platforms we can get different read length. However, few kinds of software support assembly using hybrid sequencing data.

I'm de novo assembling planarian genome. The genome is big (~1.9Gb) and includes lots of repetitive sequences (repetitiveness ~ 66%). So, it is one of the most difficulty genome to be de novo assembled.

Ive already get Illumina, 454 and Sanger data. And I try to use all of them in de novo assembly. In my experience, I tried Velvet, SOAPdenovo, Abyss, Allpath-lg, and I will try Celera. However, only Allpath-lg and Celera seem OK for hybrid data, but not so good.

Is there anyone who is doing similar work as me, and also wants to use hybrid data to perform assembly? I expect to discuss with you!
Godevil is offline   Reply With Quote
Old 12-16-2011, 08:44 AM   #2
severin
Genome Informatics Facility
 
Location: Iowa @isugif

Join Date: Sep 2009
Posts: 105
Default Other Assembly programs

Ray can also handle the assembly of multiple formats.
severin is offline   Reply With Quote
Old 12-17-2011, 06:08 AM   #3
Ole
Member
 
Location: Oslo, Norway

Join Date: Oct 2011
Posts: 17
Default

You could try MSR-CA (http://www.genome.umd.edu/SR_CA_MANUAL.htm, the source code is here: ftp://ftp.genome.umd.edu/pub/MSR-CA/) too, if you get it up and running properly. I haven't managed to get it run properly on my complete dataset yet, it seems to have a couple of bottlenecks or weirdly designed code. I ran into memory problems with 1.3.3, and 1.4b have some perl scripts that is really slow (reduce_sr.pl have been running for 3-4 days now).

The premise for MSR-CA is really interesting though, assemble Illumina reads into highly confident unitigs/contigs with a de Bruijn graph, which is then combined with other data (454, Sanger) in CA afterwards.
Ole is offline   Reply With Quote
Old 12-18-2011, 11:09 PM   #4
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

The big question is whether there ever will be one tool for all (these) different datatypes. The different assembler out there are tailored to different sequencing platforms for good reasons. Short reads can not be assembled using an OLC-based approach; this was solved by implementing the de Bruijn Graph. Now that these short-read technologies reach 100 bases, and 150 on the MiSeq (and GaIIx, apparently), this might change, though.

So, perhaps using the best assembler for each datatype, and then developing a merging strategy would be better? Getting the best contigs possible first, then merge them and scaffold them using the best scaffolder?

In this respect, the MSR-CA approach is quite interesting.
flxlex is offline   Reply With Quote
Old 12-19-2011, 12:18 AM   #5
Godevil
Member
 
Location: Japan

Join Date: Feb 2011
Posts: 22
Default

[QUOTE=flxlex;59886]Getting the best contigs possible first, then merge them and scaffold them using the best scaffolder? [QUOTE]

In this case, the scaffolds maybe better, but not the contigs.

I'm performing genome assembly with SOAPdenovo. This software can assemble illumina short reads in to contigs and then generate scaffolds with some extra long reads (such as 454 and sanger) - the similar procedure like you said.

But in my work, the contigs from SOAPdenovo are always very short. So, that's why I want to find some software which can generate contigs with all those data. Maybe, we can get much better contigs.
Godevil is offline   Reply With Quote
Old 12-19-2011, 03:11 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

MIRA supports Illumina, 454, Sanger and Ion Torrent data. And I think Bastien is looking into PacBio as well.
maubp is offline   Reply With Quote
Old 01-26-2012, 07:13 AM   #7
SLB
Member
 
Location: Ireland

Join Date: Sep 2010
Posts: 21
Default

Has anyone tried the recent version of Cellera (7.0), allowing up to 2 billion reads? I have it running now with 700m reads of ~140 and some 454 and pretty eager to see how it turns out.

Also, has anyone been able to get MSR-CA running. I downloaded version 4, but it seems to stop during the generation of super-reads stage.
SLB is offline   Reply With Quote
Old 01-26-2012, 08:50 AM   #8
Ole
Member
 
Location: Oslo, Norway

Join Date: Oct 2011
Posts: 17
Default

I'm started a couple of assemblies of only 454 reads (about 45 million and 85 million, respectively) with CA 7.0, but they are still at the scaffolding step, and I reckon they will run for a week or two more.

I've gotten MSR-CA 1.4 to run properly, but only on bacterial datasets (the Rhodobacter one from GAGE). I've tried it on our Illumina reads too (we have 200 million reads or something, getting more in some weeks), but it used a really long time on the reduce_sr.pl step (about 2-3 weeks). I had to stop it before it finished. So it is possible, but I think the implementation of reduce_sr.pl is a bottleneck in using MSR-CA on larger datasets. I'll come back to you when I get some experience with our new Illumina reads (in 6 weeks time).
Ole is offline   Reply With Quote
Old 01-27-2012, 09:09 AM   #9
ians
Member
 
Location: St. Louis, MO

Join Date: Aug 2011
Posts: 53
Default

Here at Cofactor Genomics, we've seen limited success.
We have good results with transcript sequence. We preassembled ILMN and 454 reads separately and then brought them together with an OLC. Here's a case where we didn't even hit the entire genome (2.6 MB) until the hybrid assembly:

https://docs.google.com/open?id=0ByS...RhZWQ2NTM2OGEz



We are currently working on getting the same type of success with genomic sequence. Come see us at AGBT where we are presenting what does/doesn't work.

@Godevil
What kind of results are you getting on the Planarian assembly? How much sequence coverage do you have on each platform? We've done this recently and had a difficult time getting results.

Last edited by ians; 01-31-2012 at 07:20 AM.
ians is offline   Reply With Quote
Old 02-21-2012, 12:47 PM   #10
ians
Member
 
Location: St. Louis, MO

Join Date: Aug 2011
Posts: 53
Default AGBT Poster

I thought I share with everyone our AGBT poster which outlines the success we had with consolidating multi-platform sequence to produce hybrid assemblies.
We outline our methods and conclusions to dealing with various types of genomes. Enjoy:

AGBT Poster

Last edited by ians; 03-29-2012 at 06:34 AM.
ians is offline   Reply With Quote
Old 02-22-2012, 12:33 AM   #11
vadim
Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 37
Default

Quote:
Originally Posted by ians View Post
I thought I share with everyone our AGBT poster which outlines the success we had with consolidating multi-platform sequence to produce hybrid assemblies.
We outline our methods and conclusions to dealing with various types of genomes. Enjoy:

https://docs.google.com/open?id=0BySV4NmVGJNfZTA4Mjg3MDEtMTAxMi00NGM0LTljOWEtYmM2N2ZjMThiZTNh
The link is broken.
vadim is offline   Reply With Quote
Old 02-22-2012, 05:21 AM   #12
ians
Member
 
Location: St. Louis, MO

Join Date: Aug 2011
Posts: 53
Default

Quote:
Originally Posted by vadim View Post
The link is broken.
oops. fixed!
ians is offline   Reply With Quote
Old 03-08-2012, 04:43 PM   #13
Godevil
Member
 
Location: Japan

Join Date: Feb 2011
Posts: 22
Default

Quote:
Originally Posted by ians View Post

@Godevil
What kind of results are you getting on the Planarian assembly? How much sequence coverage do you have on each platform? We've done this recently and had a difficult time getting results.

I cannot see your document.

Our genome assembly is bad. I think that's because of low GC content, big genome size and high repetitiveness.
I'm now taking a training course in BGI in China. I hope I can get some useful information.
Godevil is offline   Reply With Quote
Old 03-15-2012, 03:09 AM   #14
erhuangzi
Junior Member
 
Location: bj

Join Date: Feb 2012
Posts: 3
Default question

Quote:
Originally Posted by Ole View Post
I'm started a couple of assemblies of only 454 reads (about 45 million and 85 million, respectively) with CA 7.0, but they are still at the scaffolding step, and I reckon they will run for a week or two more.

I've gotten MSR-CA 1.4 to run properly, but only on bacterial datasets (the Rhodobacter one from GAGE). I've tried it on our Illumina reads too (we have 200 million reads or something, getting more in some weeks), but it used a really long time on the reduce_sr.pl step (about 2-3 weeks). I had to stop it before it finished. So it is possible, but I think the implementation of reduce_sr.pl is a bottleneck in using MSR-CA on larger datasets. I'll come back to you when I get some experience with our new Illumina reads (in 6 weeks time).
which one step in using the reduce_sr.pl script? no information about it in the manul of this software
erhuangzi is offline   Reply With Quote
Old 03-19-2012, 01:25 AM   #15
Ole
Member
 
Location: Oslo, Norway

Join Date: Oct 2011
Posts: 17
Default

Quote:
Originally Posted by erhuangzi View Post
which one step in using the reduce_sr.pl script? no information about it in the manul of this software
The MSR-CA manual is pretty lacking, but this is the step where the program tries to find redundant super reads, and remove them. That's my guess at least.
Ole is offline   Reply With Quote
Old 03-19-2012, 01:48 AM   #16
erhuangzi
Junior Member
 
Location: bj

Join Date: Feb 2012
Posts: 3
Thumbs up

Quote:
Originally Posted by Ole View Post
The MSR-CA manual is pretty lacking, but this is the step where the program tries to find redundant super reads, and remove them. That's my guess at least.
I hadn't been able to get MSR-CA running, can you run it ?And I want to use this software,how can i use it ? steps? thanks
erhuangzi is offline   Reply With Quote
Old 03-19-2012, 02:20 AM   #17
Ole
Member
 
Location: Oslo, Norway

Join Date: Oct 2011
Posts: 17
Default

Quote:
Originally Posted by erhuangzi View Post
I hadn't been able to get MSR-CA running, can you run it ?And I want to use this software,how can i use it ? steps? thanks
It's not that hard to get it running, just point it to your fastq-files and include the expected fragment size and standard deviation of it. The manual, though it could be better, covers that part pretty well: http://www.genome.umd.edu/SR_CA_MANUAL.htm

It could be useful to read the GAGE recipes too: http://gage.cbcb.umd.edu/recipes/msrca.html

Ole
Ole is offline   Reply With Quote
Old 03-26-2012, 10:31 AM   #18
Nico55
Junior Member
 
Location: Wa.

Join Date: Dec 2011
Posts: 7
Default cool poster quick question

Quote:
Originally Posted by ians View Post
I thought I share with everyone our AGBT poster which outlines the success we had with consolidating multi-platform sequence to produce hybrid assemblies.
We outline our methods and conclusions to dealing with various types of genomes. Enjoy:

AGBT Poster
Are figures 2 and 3 supposed to start with DNA not RNA?

Last edited by Nico55; 03-26-2012 at 10:33 AM. Reason: spelling
Nico55 is offline   Reply With Quote
Old 03-29-2012, 06:36 AM   #19
ians
Member
 
Location: St. Louis, MO

Join Date: Aug 2011
Posts: 53
Default

Quote:
Originally Posted by Nico55 View Post
Are figures 2 and 3 supposed to start with DNA not RNA?
Yes sorry. That was an old version. Since then, I've posted it on our site.
ians is offline   Reply With Quote
Old 03-29-2012, 06:42 AM   #20
ians
Member
 
Location: St. Louis, MO

Join Date: Aug 2011
Posts: 53
Default

Quote:
Originally Posted by Godevil View Post
I cannot see your document.

Our genome assembly is bad. I think that's because of low GC content, big genome size and high repetitiveness.
I'm now taking a training course in BGI in China. I hope I can get some useful information.
Hm, let us know if you learn anything earth-shattering from BGI.

Soon, I'll have two more chances to assemble planarian (both sexual and asexual). Since then, we've uncovered some heavy adapter contamination in our LIMP libraries. After re-sequencing, we'll see if this makes any difference.

Planarian remains to be a very difficult genome to assemble, but we'll see if we can get any closer..
ians is offline   Reply With Quote
Reply

Tags
de novo assembly, hybrid sequencing data, software

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:34 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO