Seqanswers Leaderboard Ad

**shal** · 05-02-2012, 02:29 AM

De novo genome assembly: beginner

Hi folks!!

I too am completely new to the NGS. and I am struggling with a question:

What decides the 'amount of data' needed for a de novo genome assembly for a particular organism??????? What decides the insert size during the library construction???

How to decide important parameters such as coverage, size, accuracy, and sensitivity; library type (fragment or mate paired?); and read length.

Any one can please help me!!

many thanks in advance!!
shal

**Linnea** · 05-02-2012, 06:38 AM

Hi shal,

It is very hard to say in advance how much data you will need for a making a denovo assembly of a particular organism.. It not only depends on the genomic content (especially repeats, polymorphism, GC content) and the assembly software you're using, but also on the properties of the reads you get (in terms of for example quality and distribution).

I have assembled a 1 Gb genome and got the best result when I used just a subset of my data (less than 30X coverage), for others it works better using 50X or 80X. You can always start with a smaller amount (but probably never below 20X) and then sequence more if you are unsatisfied with the results.

Also, I would say that after you have decided on to what coverage you would like to have, sequence at least 1.5 times more (or even 2 times more), since you will loose some in the filtering steps (some reads will be duplicated, some will have too poor quality etc).

For the insert size, you should preferrably have a mix of short and long libraries. The shorter paired end (ins <1000bp) are used for building contigs, and the longer mate pairs are used for joining the contigs into scaffolds. For mate-pairs I would say "the longer the better" - longer insert size in mate-pair libraries will certainly give you larger N50 of the assembly. But it's usually the costs that sets the limit... Note that some assemblers (like Allpaths-LG) have certain recomendations for setting up the libraries.

Paired/Mate-pair reads are better than single end reads for denovo assembly. With Illumina (I suppose you intend to use this since you chose this forum) the read length isn't very variable, they go up to ~150bp. Most of our libraries were 100bp (which worked fine), when we tried longer reads it seemed that the read quality was much poorer the last 50 bp, so we ended up trimming them anyway.

I'm not sure if I understood your questions regarding sensitivity and accuracy (in reads or assembly?), but hope this helps a bit!

Good luck!

**shal** · 05-03-2012, 01:13 AM

Originally posted by Linnea View Post

Hi shal,

It is very hard to say in advance how much data you will need for a making a denovo assembly of a particular organism.. It not only depends on the genomic content (especially repeats, polymorphism, GC content) and the assembly software you're using, but also on the properties of the reads you get (in terms of for example quality and distribution).

I have assembled a 1 Gb genome and got the best result when I used just a subset of my data (less than 30X coverage), for others it works better using 50X or 80X. You can always start with a smaller amount (but probably never below 20X) and then sequence more if you are unsatisfied with the results.

Also, I would say that after you have decided on to what coverage you would like to have, sequence at least 1.5 times more (or even 2 times more), since you will loose some in the filtering steps (some reads will be duplicated, some will have too poor quality etc).

For the insert size, you should preferrably have a mix of short and long libraries. The shorter paired end (ins <1000bp) are used for building contigs, and the longer mate pairs are used for joining the contigs into scaffolds. For mate-pairs I would say "the longer the better" - longer insert size in mate-pair libraries will certainly give you larger N50 of the assembly. But it's usually the costs that sets the limit... Note that some assemblers (like Allpaths-LG) have certain recomendations for setting up the libraries.

Paired/Mate-pair reads are better than single end reads for denovo assembly. With Illumina (I suppose you intend to use this since you chose this forum) the read length isn't very variable, they go up to ~150bp. Most of our libraries were 100bp (which worked fine), when we tried longer reads it seemed that the read quality was much poorer the last 50 bp, so we ended up trimming them anyway.

I'm not sure if I understood your questions regarding sensitivity and accuracy (in reads or assembly?), but hope this helps a bit!

Good luck!

Dear Linnea,

Thank you so much for sharing your knowledge and experience. Your reply was helpful for me and it answered my queries.

Thanks again
Shal

**Meligethes** · 05-17-2012, 05:31 PM

Hi ! Me again

I was just wondering all possible clues for scaffolding that are used, I mean I know we can map contigs agains reference genomes, use long paired-end reads , but is/are there other way(s) to find such things like orientation, order, distances ??

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News