SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DeNovo assembly using pacBio data krittika.sasmal Pacific Biosciences 50 06-05-2013 10:56 AM
Denovo assembly problem huma Asif Illumina/Solexa 1 03-27-2013 10:20 PM
denovo assembly nagaraj Bioinformatics 5 07-11-2012 07:13 AM
SOAP denovo assembly results bioinfosm Bioinformatics 9 06-13-2012 08:22 AM
DeNovo Transcriptome Assembly (human) Uwe Appelt Bioinformatics 0 10-28-2009 02:42 AM

Reply
 
Thread Tools
Old 08-16-2012, 02:23 AM   #1
kenietz
Member
 
Location: Singapore

Join Date: Nov 2011
Posts: 85
Question question about denovo assembly

Hi guys,
i have a question regarding denovo assembly.

Firstly some info on my machine:
64 bit Slackware linux
64GB RAM
i7 3930 K

For a small genomes there is no problem. But what about 3Gb genomes? How can one handle such a task considering the available power of the working machine?

Once i had a RNAseq with 70M PE illumina reads which i tried to assemble with SOAPdenovo-trans but the program broke after loading 100M reads.So i am concerned that for a 3Gb genome with 10x coverage at least i will have a huge number of reads which i wont be able to handle with SOAP-denovo. I saw that they have a '-a' option for the pregraph which supposedly would restrict the memory usage but i have the feeling i will have problems.

Then i was having the idea to split the raw reads into smaller files and make mini denovo assemblies on the split reads and then merge them somehow. But for now i could not find software which could allow me to do that.

So i would like to ask if there is such software which will allow me to do the procedure described above? Or is there some other strategy to tackle that problem? Or i just need a machine with loots of RAM?

Thank you for your time and any help!!
kenietz is offline   Reply With Quote
Old 08-16-2012, 03:55 AM   #2
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Try Titus Brown's Diginorm (Digitial normalization) program on your sample before you run SOAP-denovo. It should reduce the number of reads without reducing the complexity of the sample. I do not have a reference to the program available but a google search or looking through this forum should bring up a link.
westerman is offline   Reply With Quote
Old 08-16-2012, 03:55 AM   #3
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

An efficient de novo algorithm such as this might be of interest :

http://www.ncbi.nlm.nih.gov/pubmed/22156294

Otherwise, you can't beat more RAM. Perhaps a cloud service hi ram service such as that from BGI might be the cost effective solution since you're not going to be doing this every day.
colindaven is offline   Reply With Quote
Old 08-16-2012, 06:45 AM   #4
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Does BGI actually provide large memory machines? I do not use them and just now tried looking through their offerings but could not find information on memory limits. I know that Amazon's, admittedly not-bioinformatics-oriented, EC2 cloud only goes up to 64GB.
westerman is offline   Reply With Quote
Old 08-16-2012, 07:50 AM   #5
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

Re BGI, I'm not sure either whether they offer it commercially, but they do have the resources to do it and plenty of experience with assembling 3Gb size genomes from Illumina data.

They do talk about de novo assembly with Hecate here:
https://cloud.genomics.cn/index.php/...duct_introduce
colindaven is offline   Reply With Quote
Old 08-16-2012, 07:10 PM   #6
kenietz
Member
 
Location: Singapore

Join Date: Nov 2011
Posts: 85
Default

Hi,
Thank you all for the prompt replies.

I downloaded and compiled the SGA and will give it a try. Seems that it might do the job

Will also try Diginorm. It might help in other denovo projects as well.

Last option is to talk to the boss to buy computer with 192 GB RAM. That will do the job for sure

Thanks again
Cheers
kenietz is offline   Reply With Quote
Old 08-16-2012, 07:33 PM   #7
DFJ111
Member
 
Location: Auckland

Join Date: Aug 2012
Posts: 20
Default

Gossamer:

http://www.ncbi.nlm.nih.gov/pubmed/22611131

claims to be close to the theoretical lower limit for memory usage for de bruijn graph de novo assemblers.
DFJ111 is offline   Reply With Quote
Old 08-17-2012, 02:57 AM   #8
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by kenietz View Post
Hi,
Thank you all for the prompt replies.

I downloaded and compiled the SGA and will give it a try. Seems that it might do the job

Will also try Diginorm. It might help in other denovo projects as well.

Last option is to talk to the boss to buy computer with 192 GB RAM. That will do the job for sure

Thanks again
Cheers
Quality trimming your data will drastically reduce the memory usage by de bruijn graph assemblers. Unfortunately, 192 GB of RAM is no where close to what you need to assemble a 3 Gb genome unless you use a string graph assembler like SGA or readjoiner. The trade off is running time, though readjoiner is very fast compared to SGA (but does not produce the same quality assemblies as de bruijn graph assemblers in my experience). Regardless, I'm not sure you have the resources to assemble a 3 Gb genome based on your first post. What is your genome coverage?
SES is offline   Reply With Quote
Old 08-20-2012, 06:37 PM   #9
kenietz
Member
 
Location: Singapore

Join Date: Nov 2011
Posts: 85
Default

@SES:
Thank you for the information. The client wants to try out with 10x at first and then proceed with higher coverage. Yeah, i got it that SGA would probably be able to do the job. Now i am reading about readjoiner. I'm still considering if to take the job at all.

Btw, what kind of power would i really need to assemble 3Gb genome?
kenietz is offline   Reply With Quote
Old 08-20-2012, 06:54 PM   #10
DFJ111
Member
 
Location: Auckland

Join Date: Aug 2012
Posts: 20
Default

If by "power" you mean "memory", this thread might be relevant:

http://seqanswers.com/forums/showthread.php?t=2101

Talks about memory requirement for velvet, which is pretty memory-hungry. So if you can do it with velvet you could probably do it with any de-bruijn based assembly program (like gossamer that I mentioned above). Some programs are based on other methods (e.g. overlap-consensus) and I am not sure how to calculate memory requirements, although I know MIRA has a memory-requirement estimation program that comes with it.

If by power you mean processor speed, this is usually not the limiting factor in my experience.
DFJ111 is offline   Reply With Quote
Old 08-20-2012, 07:15 PM   #11
kenietz
Member
 
Location: Singapore

Join Date: Nov 2011
Posts: 85
Default

Hi DFJ111,
thanks for the info. By power i meant mainly the memory. Yeah MIRA is pretty good program but requires a lot of memory when working with illumina reads. Its like 1-1.5Gb per million reads.

So for now if i have to do that job i should try SGA or readjoiner. Or find a cluster. Btw, im not sure but do most of the assemblers run on clusters? I never used an assembler on cluster yet.
kenietz is offline   Reply With Quote
Old 08-21-2012, 06:15 AM   #12
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by kenietz View Post
Btw, im not sure but do most of the assemblers run on clusters? I never used an assembler on cluster yet.
Most? I am not sure about that. Practically speaking you only need one or two good cluster-aware assemblers so who really cares about the others?

Velvet is not, as far as I know, cluster-aware. ABySS is cluster-aware. Not sure about SGA, etc.
westerman is offline   Reply With Quote
Old 08-22-2012, 05:21 AM   #13
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by kenietz View Post
@SES:
Thank you for the information. The client wants to try out with 10x at first and then proceed with higher coverage. Yeah, i got it that SGA would probably be able to do the job. Now i am reading about readjoiner. I'm still considering if to take the job at all.

Btw, what kind of power would i really need to assemble 3Gb genome?
With 10X coverage, you will likely not get an "assembly." With that low of coverage you will just be clustering reads and then find out the "assembly" is far less in length than what you expected. If you already have a reference then this approach makes sense, but not if this will be the reference.

If you have sufficient coverage and a mixture of 454 and Illumina then you will need as much memory as you can access. The Broad reports that AllPaths uses 1.7 bytes of memory per read base, so that can be a rough guide. That suggests that 512 GB should be sufficient to assemble a 3 Gb genome, but I don't think that is the case with large plant genomes anyway. I have seen a number of talks in the last year where people (colleagues included) are doing assemblies of genomes >3 Gb on machines with 1 TB memory. Of course, this is all highly dependent on the amount and type of data you have, as well as the unique properties (i.e., repeat structure) of your species.
SES is offline   Reply With Quote
Old 08-22-2012, 05:41 AM   #14
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by westerman View Post
Most? I am not sure about that. Practically speaking you only need one or two good cluster-aware assemblers so who really cares about the others?

Velvet is not, as far as I know, cluster-aware. ABySS is cluster-aware. Not sure about SGA, etc.
Ray will use multiple processors, though I could never get Ray to produce assemblies comparable to Velvet and SOAP.
SES is offline   Reply With Quote
Old 08-24-2012, 02:23 PM   #15
samanta
Senior Member
 
Location: Seattle

Join Date: Feb 2010
Posts: 109
Default

Quote:
Originally Posted by kenietz View Post
Hi guys,
i have a question regarding denovo assembly.

Firstly some info on my machine:
64 bit Slackware linux
64GB RAM
i7 3930 K

For a small genomes there is no problem. But what about 3Gb genomes? How can one handle such a task considering the available power of the working machine?

Once i had a RNAseq with 70M PE illumina reads which i tried to assemble with SOAPdenovo-trans but the program broke after loading 100M reads.So i am concerned that for a 3Gb genome with 10x coverage at least i will have a huge number of reads which i wont be able to handle with SOAP-denovo. I saw that they have a '-a' option for the pregraph which supposedly would restrict the memory usage but i have the feeling i will have problems.

Then i was having the idea to split the raw reads into smaller files and make mini denovo assemblies on the split reads and then merge them somehow. But for now i could not find software which could allow me to do that.

So i would like to ask if there is such software which will allow me to do the procedure described above? Or is there some other strategy to tackle that problem? Or i just need a machine with loots of RAM?

Thank you for your time and any help!!

The answer to your question depends on whether you are assembling genome or transcriptome. Could you please clarify on that?
__________________
http://homolog.us
samanta is offline   Reply With Quote
Old 08-26-2012, 06:27 PM   #16
kenietz
Member
 
Location: Singapore

Join Date: Nov 2011
Posts: 85
Default

Genome assembly.
kenietz is offline   Reply With Quote
Old 08-27-2012, 02:14 AM   #17
samanta
Senior Member
 
Location: Seattle

Join Date: Feb 2010
Posts: 109
Default

Quote:
Originally Posted by kenietz View Post
Genome assembly.
Please check the Minia program discussed here. You can assemble a 3Gbase genome using about 6-8GB RAM.

http://www.homolog.us/blogs/2012/07/...ng-metagenome/

You can also check the slides posted here -

http://www.homolog.us/blogs/2012/08/...-rayan-chikhi/

If you like to split the reads into parts, the paper by Titus Brown in the first link should help you.

Please email me (samanta at homolog.us), if you need more explanation of the algorithms, because I do not check the forum frequently. The state of the art is far ahead of Velvet with 512Gb RAM, etc.
__________________
http://homolog.us
samanta is offline   Reply With Quote
Old 08-27-2012, 02:54 AM   #18
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

If I classify the reads into different chromosomes using bwa, can I "de novo"ly assemble the chromosomes in a 64GB machine?
ymc is offline   Reply With Quote
Old 08-27-2012, 07:58 AM   #19
samanta
Senior Member
 
Location: Seattle

Join Date: Feb 2010
Posts: 109
Default

Quote:
Originally Posted by ymc View Post
If I classify the reads into different chromosomes using bwa, can I "de novo"ly assemble the chromosomes in a 64GB machine?
Interesting question.

i) For kind of de novo assembly we talk about, the chromosome sequences are not known. If they were known, why would you need de novo assembly in the first place?

ii) Where chromosomes exist and you are trying to do reassembly, yes it is possible to reduce the RAM requirement by partitioning the reads. However, remember that the RAM requirement for error-free reads is capped no matter how many reads you have. However, in world with errors, RAM requirement goes up linearly with the number of reads.

http://www.homolog.us/blogs/2011/08/...bruijn-graphs/

iii) If you are trying to do reassembly of human genome using BWA, you are most likely interested in parts of chromosome with indels, etc. Unfortunately, BWA may not be able to capture the reads for those regions and assign to reference chromosome.
__________________
http://homolog.us
samanta is offline   Reply With Quote
Old 08-27-2012, 08:00 AM   #20
samanta
Senior Member
 
Location: Seattle

Join Date: Feb 2010
Posts: 109
Default

Quote:
Originally Posted by kenietz View Post
@SES:
Thank you for the information. The client wants to try out with 10x at first and then proceed with higher coverage. Yeah, i got it that SGA would probably be able to do the job. Now i am reading about readjoiner. I'm still considering if to take the job at all.

Btw, what kind of power would i really need to assemble 3Gb genome?
You can also request soapdenovo2 from BGI. Its RAM requirement is much better than SOAPdenovo, especially when you use k-mer skipping option.
__________________
http://homolog.us
samanta is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:25 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO