SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RTG Investigator 2.3.2: Improved Ion Torrent, lower RAM usage Stuart Inglis Vendor Forum 0 10-10-2011 01:52 PM
Too many short reads and too little RAM? samanta General 5 09-22-2011 06:48 AM
Assembly of short reads in Metagenomic studies chanderbio Metagenomics 9 09-01-2011 05:09 PM
Hybrid assembly w/ minimal RAM blindtiger454 De novo discovery 2 10-27-2010 09:02 PM
assembly and RAM aurelielaugraud Bioinformatics 0 09-23-2009 07:02 AM

Reply
 
Thread Tools
Old 10-10-2010, 11:55 PM   #1
Alex8
Member
 
Location: Europe

Join Date: Oct 2010
Posts: 10
Smile Pre-assembly for short-reads to minimize RAM usage

Hello everybody!

I'm looking forward to assembling de novo ~1-5 Gb of short reads from next-generation sequencer. Data is of metagenomic character, hundreds of species. The amount of RAM required by assembly program (Velvet, SOAPdenovo, etc.) for such analysis is few hundred Gb. Is there a known way to cluster the initial reads into associated related portions , so that assembly is performed in portions and RAM peak usage is decreased?

Thanks ahead,
Alex
Alex8 is offline   Reply With Quote
Old 11-02-2010, 09:13 PM   #2
LHT
Junior Member
 
Location: Malaysia

Join Date: May 2010
Posts: 3
Default

yes this is the exact same question I am having in my mind.

I have around 400 million of 36bp paired-end reads. I am in the process of trying to assemble them with velvet but I was wondering if the input is too huge and a preclustering step is needed.

If yes then what type of clustering approach?

thanks
LHT is offline   Reply With Quote
Old 11-03-2010, 12:51 AM   #3
francesco.vezzi
Member
 
Location: Udine (Italy)

Join Date: Jan 2009
Posts: 50
Default

Hi.
I think that the clustering must be made before sequencing (by selecting specific regions of the genome using enzymes for example) and then assemble each small data set indepentetly.

The only way to reduce the amount of memory needed is perform an error correction step. The problem is that the error correction step may require more RAM than the de novo assembly.

Francesco
francesco.vezzi is offline   Reply With Quote
Old 11-03-2010, 02:57 AM   #4
Alex8
Member
 
Location: Europe

Join Date: Oct 2010
Posts: 10
Default

I came upon the following discussion:
http://listserver.ebi.ac.uk/pipermai...er/001156.html
The idea is to pre-cluster kmers into non-overlapping de Brujin subgraphs and assemble them separately (using lower memory requirements), then combine the results.
Alex8 is offline   Reply With Quote
Old 11-03-2010, 06:30 PM   #5
leeht
Junior Member
 
Location: Malaysia

Join Date: Aug 2010
Posts: 2
Default

Thanks Alex8, that is quite an interesting discussion, seems like Curtain is worth looking at.

Dear Francesco, I came across your post in this discussion thread.

"...The trick usually is to work with a subset of 10% of the reads. Make multiple assemblyes of several random subsets and then merge toghether the results."

Can you please explain more on "random subsets"? Say if we assemble 10% of our reads at a time, am I correct that we will end up with 10 separate sub-assembly results for assembly/scaffolding? Or the subsets are suppose to be random, where the same read can exist in more than one subset?

thanks!
leeht is offline   Reply With Quote
Old 11-04-2010, 12:23 AM   #6
francesco.vezzi
Member
 
Location: Udine (Italy)

Join Date: Jan 2009
Posts: 50
Default

The post is quite old and this approach was usufull do to the lack of software able to assembly mere than one lane.

The idea was to PARTITION (here is your point) in 10 or less independent subsets the data and assembly each of these subset independently. This was but still is meaningful when the coverage is very high. If a Microbe is sequenced at an expected coverage of 800X then this approach is usufull.

Francesco

Quote:
Originally Posted by leeht View Post
Thanks Alex8, that is quite an interesting discussion, seems like Curtain is worth looking at.

Dear Francesco, I came across your post in this discussion thread.

"...The trick usually is to work with a subset of 10% of the reads. Make multiple assemblyes of several random subsets and then merge toghether the results."

Can you please explain more on "random subsets"? Say if we assemble 10% of our reads at a time, am I correct that we will end up with 10 separate sub-assembly results for assembly/scaffolding? Or the subsets are suppose to be random, where the same read can exist in more than one subset?

thanks!
francesco.vezzi is offline   Reply With Quote
Old 11-05-2010, 06:58 AM   #7
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

Quote:
Originally Posted by Alex8 View Post
Hello everybody!

I'm looking forward to assembling de novo ~1-5 Gb of short reads from next-generation sequencer. Data is of metagenomic character, hundreds of species. The amount of RAM required by assembly program (Velvet, SOAPdenovo, etc.) for such analysis is few hundred Gb. Is there a known way to cluster the initial reads into associated related portions , so that assembly is performed in portions and RAM peak usage is decreased?

Thanks ahead,
Alex
I think if yours is a metagenomic sample your ram requirement is likely to be large and I am guessing there will be low coverage per species / contig.

if you can already cluster the reads by kmers then you can do mini assemblies using any programs.

Have a look at Softgenetic's NextGene to do the clustering. It looks like something useful but I can't comment much as I have limited experience with it.
KevinLam is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:50 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO