SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
STAR: ultrafast universal RNA-seq aligner alexdobin Bioinformatics 218 04-02-2018 06:59 PM
Compare de-novo transcriptome assembly to genome reference guided assembly IdoBar Bioinformatics 1 04-04-2014 01:28 AM
Inquiry: minimum length of reads for referece-based assembly or de novo assembly sunfuhui Bioinformatics 1 10-04-2013 10:28 AM
de novo assembly vs. reference assembly fadista General 3 02-16-2011 12:11 AM

Reply
 
Thread Tools
Old 05-15-2014, 02:59 AM   #1
Retro
Member
 
Location: Pennsylvania

Join Date: Apr 2011
Posts: 27
Default ultrafast de novo assembly?

Is there any way how to run "quick and dirty" de novo assembly of Illumina reads from a genome? All we need is to obtain contigs at least several hundred nucleotides long. Our current runs with SOAPdenovo and Velvet are good but way too time-consuming for what we need.
Thank you for any suggestions.
Retro is offline   Reply With Quote
Old 05-15-2014, 04:37 AM   #2
TiborNagy
Senior Member
 
Location: Budapest

Join Date: Mar 2010
Posts: 329
Default

Minia is a quick and memory efficient de-novo assembler, but the results is not so accurate.
TiborNagy is offline   Reply With Quote
Old 05-15-2014, 08:15 AM   #3
sujaikumar
Junior Member
 
Location: Edinburgh, UK

Join Date: Jul 2009
Posts: 2
Default

CLC Assembly Cell is probably one of the quickest out there with reasonable results. It's expensive, but they have a 2 week trial version so you can see if it meets your needs.
sujaikumar is offline   Reply With Quote
Old 05-15-2014, 09:48 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by TiborNagy View Post
Minia is a quick and memory efficient de-novo assembler, but the results is not so accurate.
Minia may be memory efficient, but I found it to take orders of magnitude longer than Velvet.

For fast assembly, I suggest subsampling or normalizing the input reads first, to reduce the input volume - that speeds things up a lot with Velvet, at least. Subsampling is faster but normalizing is often better. You can do either with BBTools. I find that a target depth of around 40 works well with Velvet.

subsample:
reformat.sh in=reads.fq out=sampled.fq samplerate=0.1

normalize:
bbnorm.sh in=reads.fq out=normalized.fq target=40

If you have paired reads in 2 files, you can use the in1, in2, out1, and out2 flags, and pairs will stay together.
Brian Bushnell is offline   Reply With Quote
Old 05-16-2014, 03:02 AM   #5
Retro
Member
 
Location: Pennsylvania

Join Date: Apr 2011
Posts: 27
Default normalization

Thanks for the suggestions.

As for CLC - we do have CLC genomics workbench - it works great but is still too slow for what we need, not much different from Velvet

As for normalization of reads before assembly - I do not understand the methods well enough, but I was told that when you normalize, it is not good for assembly methods based on K-mers. Possibly because the methods need the information about the abundance of reads containing the K-mers and that would be lost by normalization. I am not sure if that is the same as normalization, but I wanted to use Usearch program to reduce the read numbers (dereplication or UCLUST). Usearch is fast enough for our planned throughput.
Retro is offline   Reply With Quote
Old 05-16-2014, 09:46 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The effect of normalization really depends on the normalizer and the assembler.

In my testing of BBNorm, normalization universally improves the L50 with Velvet, and some other metrics (total number of errors, total size, longest contig length, total number of contigs) may go up or down slightly but generally there is a positive trend. There's also typically a positive trend with Soap. AllPathsLG appears to be much more sensitive to read abundance patterns, and normalization seems to have a negative impact just as often as a positive one.

But subsampling does not change the relative read abundance, it just scales it down by a constant factor across the whole genome, so if you are worried about the effects of normalization then subsampling is a better option. It's extremely fast. Dereplication is not a bad idea, but if you only remove identical read pairs, it won't decrease your data volume much. If you treat data as single-ended and remove all duplicate individual reads, it will reduce your data much more. However, dereplication DOES increase the error rate, since reads with errors are less likely to be duplicates. You may wish to error-correct first, which BBNorm can also do - that will cause more reads to be removed.
Brian Bushnell is offline   Reply With Quote
Old 05-27-2014, 05:35 PM   #7
rchikhi
Member
 
Location: France

Join Date: Jan 2013
Posts: 13
Default

Minia dev here. I regret to hear that for some of you Minia has been inaccurate or too slow.

Minia is IO-intensive, so it can be slow if you run it on a network-mounted folder (e.g. your cluster's home directory). On a regular hard drive, or even better a SSD, it will be quick; I stand by the claim that human-sized genomes are assembled in a day on a plain desktop computer.

Regarding the quality of Minia results, in my tests (using QUAST) I never noticed more misassemblies than other assemblers. TiborNagy, could you elaborate your comment?

To contribute to the thread: if all you have is a single machine with many CPU's, then SOAPdenovo2/Velvet using all CPU cores are likely to be the fastest assemblers. Minia's pretty fast using just 1 thread. I recall that ABySS was able to assemble a human genome in half a day using a cluster, and it's likely that the Ray assembler will match this performance as well.
rchikhi is offline   Reply With Quote
Old 05-28-2014, 06:45 AM   #8
TiborNagy
Senior Member
 
Location: Budapest

Join Date: Mar 2010
Posts: 329
Default

Quote:
Originally Posted by rchikhi View Post
TiborNagy, could you elaborate your comment?
Of course I can. I have tried Minia 3 years ago and I have tried to assemble human HLA genes with different assemblers. (Yes, it is a very hard task, I known) I have mapped the contigs back to the human reference and watched the coverage. Minia was the fastest program, but the contigs were too small. (Sorry, I can not remember the exact values.)

I have read the Minia article. It is a clever algorithm, but does not fit for every task.
TiborNagy is offline   Reply With Quote
Old 05-28-2014, 08:04 AM   #9
rchikhi
Member
 
Location: France

Join Date: Jan 2013
Posts: 13
Default

Thanks for your comment.

There's a difference between accuracy of contigs (misassembly, mismatches) and contiguity (how long the contigs are).

Yes it make sense to say that Minia doesn't always produce the most contiguous results, given that it has a very simple contig construction algorithm that doesn't use read information or paired-end. However, in terms of accuracy (misassembles, mismatches), it should perform reasonably well.
rchikhi is offline   Reply With Quote
Old 05-28-2014, 08:36 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I should clarify that I've only tried Minia once, and it was on a metagenome of unknown size and composition (the assemblies came out at 30 to 60 Mbp). I ran Velvet, Spades, Soap, and Minia. Soap crashed; Velvet was the fastest, and Minia took a long time. None of the assemblies were any good (L50 much shorter than read length). Our disk subsystem is very unpredictable and often extremely slow, which could have been the problem.

So, that could be an anomalous result compared to running it on an isolate using local disk.
Brian Bushnell is offline   Reply With Quote
Old 05-28-2014, 08:52 AM   #11
rchikhi
Member
 
Location: France

Join Date: Jan 2013
Posts: 13
Default

Ty for the details -- slow disk system is the only reason why Minia can be slow, so this makes sense.
rchikhi is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:50 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO