Seqanswers Leaderboard Ad

**LHT** · 11-02-2010, 08:13 PM

yes this is the exact same question I am having in my mind.

I have around 400 million of 36bp paired-end reads. I am in the process of trying to assemble them with velvet but I was wondering if the input is too huge and a preclustering step is needed.

If yes then what type of clustering approach?

thanks

**francesco.vezzi** · 11-02-2010, 11:51 PM

Hi.
I think that the clustering must be made before sequencing (by selecting specific regions of the genome using enzymes for example) and then assemble each small data set indepentetly.

The only way to reduce the amount of memory needed is perform an error correction step. The problem is that the error correction step may require more RAM than the de novo assembly.

Francesco

**Alex8** · 11-03-2010, 01:57 AM

I came upon the following discussion:

EBI-EMBL Mailman list

http://listserver.ebi.ac.uk/pipermail/velvet-users/2010-October/001156.html

The idea is to pre-cluster kmers into non-overlapping de Brujin subgraphs and assemble them separately (using lower memory requirements), then combine the results.

**leeht** · 11-03-2010, 05:30 PM

Thanks Alex8, that is quite an interesting discussion, seems like Curtain is worth looking at.

Dear Francesco, I came across your post in this discussion thread.

"...The trick usually is to work with a subset of 10% of the reads. Make multiple assemblyes of several random subsets and then merge toghether the results."

Can you please explain more on "random subsets"? Say if we assemble 10% of our reads at a time, am I correct that we will end up with 10 separate sub-assembly results for assembly/scaffolding? Or the subsets are suppose to be random, where the same read can exist in more than one subset?

thanks!

**francesco.vezzi** · 11-03-2010, 11:23 PM

The post is quite old and this approach was usufull do to the lack of software able to assembly mere than one lane.

The idea was to PARTITION (here is your point) in 10 or less independent subsets the data and assembly each of these subset independently. This was but still is meaningful when the coverage is very high. If a Microbe is sequenced at an expected coverage of 800X then this approach is usufull.

Francesco

Originally posted by leeht View Post

Thanks Alex8, that is quite an interesting discussion, seems like Curtain is worth looking at.

Dear Francesco, I came across your post in this discussion thread.

"...The trick usually is to work with a subset of 10% of the reads. Make multiple assemblyes of several random subsets and then merge toghether the results."

Can you please explain more on "random subsets"? Say if we assemble 10% of our reads at a time, am I correct that we will end up with 10 separate sub-assembly results for assembly/scaffolding? Or the subsets are suppose to be random, where the same read can exist in more than one subset?

thanks!

**KevinLam** · 11-05-2010, 05:58 AM

Originally posted by Alex8 View Post

Hello everybody!

I'm looking forward to assembling de novo ~1-5 Gb of short reads from next-generation sequencer. Data is of metagenomic character, hundreds of species. The amount of RAM required by assembly program (Velvet, SOAPdenovo, etc.) for such analysis is few hundred Gb. Is there a known way to cluster the initial reads into associated related portions , so that assembly is performed in portions and RAM peak usage is decreased?

Thanks ahead,
Alex

I think if yours is a metagenomic sample your ram requirement is likely to be large and I am guessing there will be low coverage per species / contig.

if you can already cluster the reads by kmers then you can do mini assemblies using any programs.

Have a look at Softgenetic's NextGene to do the clustering. It looks like something useful but I can't comment much as I have limited experience with it.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 33 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Pre-assembly for short-reads to minimize RAM usage

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News