SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Reply
 
Thread Tools
Old 11-14-2012, 04:15 PM   #1
rayanc
Junior Member
 
Location: FR

Join Date: Sep 2011
Posts: 6
Default Minia: ultra-low memory contigs assembly

Hi Seqanswers,

We would like to announce the open-source release of our de novo assembly software: Minia. Minia is based on a de Bruijn graph, and its main focus is ultra-low memory assembly.

It is capable of assembling a human genome on a desktop computer in a day. Minia produces contigs of similar contiguity and accuracy to other de Bruijn assemblers (e.g. Velvet).

Download link and PDF article: http://minia.genouest.org

Support site (similar to stackoverflow. Any question, bugs, feedback can be reported there).

Looking forward to hear your feedback;

Rayan
rayanc is offline   Reply With Quote
Old 11-15-2012, 04:37 AM   #2
yaximik
Senior Member
 
Location: Oregon

Join Date: Apr 2011
Posts: 205
Default

Interesting. I did not examine details yet, but a quick question - can Minia take advantage of larger memory, when available? I have two machines, for example, one with only 8 GB and another with 96 GB, would this make a difference?
yaximik is offline   Reply With Quote
Old 11-15-2012, 08:21 AM   #3
rayanc
Junior Member
 
Location: FR

Join Date: Sep 2011
Posts: 6
Default

Quote:
Originally Posted by yaximik View Post
Interesting. I did not examine details yet, but a quick question - can Minia take advantage of larger memory, when available? I have two machines, for example, one with only 8 GB and another with 96 GB, would this make a difference?
Good question. Minia won't take advantage of larger memory, because the size of the data structure depends solely on the number of distinct solid k-mers.

However, if you have plenty of memory, a great thing to do is to run 2-3 Minia's in parallel to try various k-mer sizes.
rayanc is offline   Reply With Quote
Old 11-24-2012, 05:24 PM   #4
yaximik
Senior Member
 
Location: Oregon

Join Date: Apr 2011
Posts: 205
Default

All right, on my fisrt try I got the following:
Code:
[yaximik@SciLinux55 minia assembly]$ minia SClist 27 3 3.5 Try1
estimated values: nbits Bloom 6, nb FP 0, max memory 1 MB
taille cell 16 
file SClist is interpreted as a list of file names
Reading 7 read files
Available disk space in /home/yaximik/Data/minia assembly: 13255 MB
Sequentially counting ~23078 MB of kmers with 16486 partition(s) using 1 thread(s), ~1 MB of memory and ~6523 MB of disk space
error during fopen: Too many open files  write 1 
[yaximik@SciLinux55 minia assembly]$
So, what does this mean?
yaximik is offline   Reply With Quote
Old 11-24-2012, 06:38 PM   #5
yaximik
Senior Member
 
Location: Oregon

Join Date: Apr 2011
Posts: 205
Default

Oops. Second attempt with the same dataset but on a bigger machine, RHEL 58, 96 GB memory:
Code:
[yaximik@G5NNJN1 minia]$ minia SCall.txt 27 3 3500000000 try1
estimated values: nbits Bloom 36, nb FP 50791608, max memory 8192 MB
taille cell 16 
file SCall.txt is interpreted as a list of file names
Reading 7 read files
Available disk space in /home/yaximik/AssRefMap/minia: 516283 MB
Sequentially counting ~23078 MB of kmers with 4 partition(s) using 1 thread(s), ~8192 MB of memory and ~6523 MB of disk space
Segmentation fault
[yaximik@G5NNJN1 minia]$
What could be the problem now?
yaximik is offline   Reply With Quote
Old 11-26-2012, 02:17 PM   #6
rayanc
Junior Member
 
Location: FR

Join Date: Sep 2011
Posts: 6
Default

My replies took a long time to be moderated. I communicated via email with yaximik and it turns out that his dataset had reads longer than what Minia expected. A future release of Minia will allow very long reads, and meanwhile, I sent him an updated version. Keep in mind that, due to the nature of Minia's assembly algorithm, very long reads aren't really useful.

Last edited by rayanc; 11-30-2012 at 10:24 AM.
rayanc is offline   Reply With Quote
Old 12-01-2012, 08:02 AM   #7
yaximik
Senior Member
 
Location: Oregon

Join Date: Apr 2011
Posts: 205
Default

Yes, it did not quit on me with this dataset after the fix rayanc provided. However, it worked for 24 hours and I had to kill the process as it slowed everything else on the smaller machine with 8Gb memory to a crawl. Disk activity was non-stop all this time, however all temporary files generated still were empty. The data set was not really big, around 8GB total. I wonder if slowing down was due to shuffling around the limited amount of memory, at beginning it said something that 8.1 GB required (?). I did not try on 96 GB machine though.
yaximik is offline   Reply With Quote
Old 12-01-2012, 08:09 AM   #8
rayanc
Junior Member
 
Location: FR

Join Date: Sep 2011
Posts: 6
Default

Quote:
Originally Posted by yaximik View Post
I had to kill the process as it slowed everything else on the smaller machine with 8Gb memory to a crawl. [..] at beginning it said something that 8.1 GB required
If your machine has 8 GB of memory and Minia starts by saying that 8.1 GB are required, then it is expected that the machine will swap to disk a lot and slow every process down.

The "8.1 GB required" number comes from the estimated genome size (in your command line, it was 3.5 Gbp). If you are assembling a human genome, the assembled genome size will be closer to 2.7 Gbp. If you launch Minia again with the estimated_genome_size parameter equal to 2700000000, it will require less memory (around 6 GB) and I bet that it will complete without problems.
rayanc is offline   Reply With Quote
Old 12-03-2012, 06:29 PM   #9
yaximik
Senior Member
 
Location: Oregon

Join Date: Apr 2011
Posts: 205
Default

Quote:
The "8.1 GB required" number comes from the estimated genome size (in your command line, it was 3.5 Gbp). If you are assembling a human genome, the assembled genome size will be closer to 2.7 Gbp. If you launch Minia again with the estimated_genome_size parameter equal to 2700000000, it will require less memory (around 6 GB) and I bet that it will complete without problems.
OK, I tried now as follows, but ended up with error again:
Code:
[yaximik@SciLinux55 minia]$ minia Data_File_list 27 3 3000000000 try1
estimated values: nbits Bloom 35, nb FP 43535664, max memory 4096 MB
taille cell 16 
file Data_File_list is interpreted as a list of file names
Reading 6 read files
Available disk space in /home/yaximik/Data/Assemblies/SC/minia: 37505 MB
Sequentially counting ~19361 MB of kmers with 4 partition(s) using 1 thread(s), ~4096 MB of memory and ~3405 MB of disk space
*** glibc detected *** minia: munmap_chunk(): invalid pointer: 0x00000000020df610 ***
======= Backtrace: =========
/lib64/libc.so.6(cfree+0x166)[0x3f60272886]
minia[0x413479]
minia[0x415681]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3f6021d994]
minia(__gxx_personality_v0+0xd1)[0x401f79]
======= Memory map: ========
00400000-0041c000 r-xp 00000000 fd:00 2524728                            /home/yaximik/Bioinformatics/minia-1.4570/minia
0061c000-0061d000 rw-p 0001c000 fd:00 2524728                            /home/yaximik/Bioinformatics/minia-1.4570/minia
020a7000-02113000 rw-p 020a7000 00:00 0                                  [heap]
3f5f200000-3f5f21c000 r-xp 00000000 fd:00 4915201                        /lib64/ld-2.5.so
3f5f41b000-3f5f41c000 r--p 0001b000 fd:00 4915201                        /lib64/ld-2.5.so
3f5f41c000-3f5f41d000 rw-p 0001c000 fd:00 4915201                        /lib64/ld-2.5.so
3f60200000-3f6034e000 r-xp 00000000 fd:00 4915202                        /lib64/libc-2.5.so
3f6034e000-3f6054d000 ---p 0014e000 fd:00 4915202                        /lib64/libc-2.5.so
3f6054d000-3f60551000 r--p 0014d000 fd:00 4915202                        /lib64/libc-2.5.so
3f60551000-3f60552000 rw-p 00151000 fd:00 4915202                        /lib64/libc-2.5.so
3f60552000-3f60557000 rw-p 3f60552000 00:00 0 
3f60600000-3f60682000 r-xp 00000000 fd:00 4915203                        /lib64/libm-2.5.so
3f60682000-3f60881000 ---p 00082000 fd:00 4915203                        /lib64/libm-2.5.so
3f60881000-3f60882000 r--p 00081000 fd:00 4915203                        /lib64/libm-2.5.so
3f60882000-3f60883000 rw-p 00082000 fd:00 4915203                        /lib64/libm-2.5.so
3f61200000-3f61214000 r-xp 00000000 fd:00 13433875                       /usr/lib64/libz.so.1.2.3
3f61214000-3f61413000 ---p 00014000 fd:00 13433875                       /usr/lib64/libz.so.1.2.3
3f61413000-3f61414000 rw-p 00013000 fd:00 13433875                       /usr/lib64/libz.so.1.2.3
3f71400000-3f7140d000 r-xp 00000000 fd:00 4915222                        /lib64/libgcc_s-4.1.2-20080825.so.1
3f7140d000-3f7160d000 ---p 0000d000 fd:00 4915222                        /lib64/libgcc_s-4.1.2-20080825.so.1
3f7160d000-3f7160e000 rw-p 0000d000 fd:00 4915222                        /lib64/libgcc_s-4.1.2-20080825.so.1
3f72000000-3f720e6000 r-xp 00000000 fd:00 13413044                       /usr/lib64/libstdc++.so.6.0.8
3f720e6000-3f722e5000 ---p 000e6000 fd:00 13413044                       /usr/lib64/libstdc++.so.6.0.8
3f722e5000-3f722eb000 r--p 000e5000 fd:00 13413044                       /usr/lib64/libstdc++.so.6.0.8
3f722eb000-3f722ee000 rw-p 000eb000 fd:00 13413044                       /usr/lib64/libstdc++.so.6.0.8
3f722ee000-3f72300000 rw-p 3f722ee000 00:00 0 
2b2ccef59000-2b2ccef63000 rw-p 2b2ccef59000 00:00 0 
2b2ccef77000-2b2ccef7a000 rw-p 2b2ccef77000 00:00 0 
2b2ccef7b000-2b2ccef7c000 rw-p 2b2ccef7b000 00:00 0 
7fff18267000-7fff18282000 rw-p 7ffffffe3000 00:00 0                      [stack]
7fff18310000-7fff18314000 r-xp 7fff18310000 00:00 0                      [vdso]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vsyscall]
Aborted
[yaximik@SciLinux55 minia]$
Not sure if something is missing on my machine or it is a bug in the progarm.
yaximik is offline   Reply With Quote
Old 12-08-2012, 02:12 PM   #10
samanta
Senior Member
 
Location: Seattle

Join Date: Feb 2010
Posts: 109
Default

We tested Minia extensively based on a pre-release version (minia-1.3842.tar.gz) that Ryan sent us, and we are very happy with the results. We used it to check the assembly of a fish genome that we are currently working on.

Please note that minia is a contig assembler aimed to perform the first (and the hardest) stage of assembly, namely assembling the short reads and its de Bruijn graph into contigs. Minia does this step very well, although I do not have any performance metrics (Rayan's paper has plenty). After minia completes its job, you need to assemble the contigs into scaffolds using Velvet, SOAPdenovo2 or AMOS.

The best way to use more RAM than what minia needs is to run assemblies with multiple k-mers in parallel, or run many different assemblies. However, please try to keep reasonable amount of disk space available to hold all temporary files.

If you are interested to know how Minia works, Rayan's paper is the best source, but we also wrote few blog posts as well -

http://www.homolog.us/blogs/2012/07/...ng-metagenome/

http://www.homolog.us/blogs/2012/10/...ch-revolution/
__________________
http://homolog.us
samanta is offline   Reply With Quote
Old 12-09-2012, 07:25 AM   #11
yaximik
Senior Member
 
Location: Oregon

Join Date: Apr 2011
Posts: 205
Default

It appeared my problems were due to attempts to include a small earlier datasets of longer 454 and Sanger reads, which were not supposed to be used anyway, but I just wanted to see if any earlier data will fit anywhere in new assemblies largely based on MiSeq data. Rayan was very very helpful addressing this issue in new updates, which I hope will be released. I had no chance to try the latest update yet, which should have the issue solved. Assembly with only Illumina reads worked fine, although Newbler produced much longer contigs with smaller minimum overlap values, which is closest to k-mer I could think of. I guess samantha's input gives some clues. It would be interesting to try passing minia's contigs through Newbler. I am not sure velvet will be helpful as it seems designed largely for bacterial-size genomes, but in initial phase with smaller data sets it may be.

Last edited by yaximik; 12-09-2012 at 07:28 AM.
yaximik is offline   Reply With Quote
Old 12-17-2012, 09:23 AM   #12
rayanc
Junior Member
 
Location: FR

Join Date: Sep 2011
Posts: 6
Default

Thank you Manoj (samanta) and yaximik for the positive feedback.

I could add that a good scaffolder to use after Minia is SSPACE, in my experience it is slightly easier to run than SOAPdenovo's.

Also, if you are doing bacterial assembly on projects which include 454 reads, I expect that Newbler would do a better job than Minia in terms of contiguity.

The updated version of Minia that yaximik was talking about has been released, it fixes the bug above in this thread, when very long reads are used. If you have enough Illumina coverage, I still don't recommend including 454 reads in a Minia assembly though.
rayanc is offline   Reply With Quote
Old 02-12-2013, 08:12 PM   #13
yaximik
Senior Member
 
Location: Oregon

Join Date: Apr 2011
Posts: 205
Default

I wonder if assembly with minia can benefit from GNU parallel?
yaximik is offline   Reply With Quote
Old 02-13-2013, 08:03 AM   #14
rchikhi
Member
 
Location: France

Join Date: Jan 2013
Posts: 13
Default

Yes, you can run a couple of Minia's at the same time. (using `parallel` or just some terminal windows..)

Given that the first step of Minia becomes limited by your hard drive speed, I would not expect significant performance improvement if you run more than 3-5 instances of Minia at the same time on a desktop computer.
rchikhi is offline   Reply With Quote
Old 04-01-2013, 02:29 AM   #15
sivasubramani
Member
 
Location: India

Join Date: Apr 2011
Posts: 14
Default

Hello Rayanc,

I appreciate the work what you have done on minia. If you can mention about the platform on what we can use minia (illumina or SOLiD, etc) also it can handle colorspace data..??

Thanks
sivasubramani is offline   Reply With Quote
Old 04-01-2013, 06:31 AM   #16
rchikhi
Member
 
Location: France

Join Date: Jan 2013
Posts: 13
Default

Hi, Minia has been tested exclusively on Illumina data. It does not handle color-space reads. If you have color-space, you could always convert the reads to FASTA and feed them to Minia (but then the color-space redundancy will not be used).
rchikhi is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:32 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO