Seqanswers Leaderboard Ad

**jimmybee** · 07-23-2012, 02:44 PM

How repetitive is it? How close is the reference genome? Is it highly polymorphic? Diploid/tetraploid/hexaploid?

**mabentley86** · 07-24-2012, 08:03 AM

Hi Jimmy,

Thanks for replying. I'm working with Lepitdopteran genomes (butterflies and moths), which for current ref genomes are reported to have fairly high repetitive content. The closest ref genomes are Manduca, Heliconius, Bombyx and Monarch. When I blastn conserved genes against these, the best similarity hits are ~90%, the worst are ~65%. The genomes are diploid.

My current strategy is to first filter raw paired-end reads for low quality/adapters and then go straight into assembly with SOAPdenovo (47 kmer). I'm running this on a cluster, so will hopefully have some indication of its success in the next day or two. Any further comments welcomed.

Best,

Michael

**darked89** · 07-24-2012, 08:53 AM

Few tips:

1) error correction prior to assembly
I used Coral (5 iterations) but you may check Quake or Reptile

2) depending on insert sizes of your libs, you may try to find overlaps between paired ends prior to assembly with FLASH

http://www.genomics.jhu.edu/software/FLASH/index.shtml

3) try more assemblers (Abyss, SGA, etc.), compare the results

If you feel like experimenting:

http://ged.msu.edu/papers/2012-diginorm/

**jimmybee** · 07-24-2012, 03:46 PM

Originally posted by mabentley86 View Post

Hi Jimmy,

Thanks for replying. I'm working with Lepitdopteran genomes (butterflies and moths), which for current ref genomes are reported to have fairly high repetitive content. The closest ref genomes are Manduca, Heliconius, Bombyx and Monarch. When I blastn conserved genes against these, the best similarity hits are ~90%, the worst are ~65%. The genomes are diploid.

My current strategy is to first filter raw paired-end reads for low quality/adapters and then go straight into assembly with SOAPdenovo (47 kmer). I'm running this on a cluster, so will hopefully have some indication of its success in the next day or two. Any further comments welcomed.

Best,

Michael

Looks ok. I second darked89's points regarding trying different assemblers. Setup some bash scripts to experiment with different parameters too. I can't comment on error correction too much, but considering your low coverage it wouldnt be a bad option

Have a look at the assemblathon papers to pick up some good assembly tips too

**Tong.W** · 07-24-2012, 05:14 PM

You had better have more data ,not just more coverage,but also other sequence library data,just like 2000bp insert size or bigger one.A reference may be not a guide for your assembly if they are not so closed,sometimes it may introduce many errors.

**Wallysb01** · 07-24-2012, 05:51 PM

The tips given here are great, but with 8-12x coverage, you're not going to have a usable assembly. With sequencing errors, heterozygousity, and uneven coverage, you might only assemble 1/4 of the genome into contigs >200bp. Maybe you'd be able to get away with it if the genome had few repeats, but it sounds like that's not the case.

What you may find is that you can map single exons of genes to your contigs, but that will make orthology assignment difficult in many cases. So with that kind of coverage, I'd stick to alignment of your reads to the most closely related species, and try to allow for some extra sequence divergence. From my experience 30x coverage is about the minimum for de novo assembly from NGS data. You could probably get way with traditional Illumina libraries if you did 300, 600 and 1000bp inserts, and avoid the mate pair libraries, at least initially. Though if you want as good of an assembly as possible, you'll need some 2-10kbp mate pair libraries, as well.

**mabentley86** · 07-25-2012, 01:44 AM

Thanks for all the tips everyone, all very helpful.

The aim of gathering this kind of data was to build low quality assemblies that could be queried for coding sequences, without worrying so much about synteny or other genome features. This is why different insert sizes weren't used. I guess the aim was to build the cheapest genome that could still prove to be useful.

As Wallysb mentioned, the problem I'm finding is that it is difficult to assign orthology to hits. This is not such a problem for one of my genes of interest, but several of the others appear to have lineage specific duplicate copies. Here the trouble is first identifying true exon hits, and then piecing them together correctly. The process is also very manual, writing scripts wouldn't really help for this part of the process as I need to eyeball everything and tweak parameters just to be sure I'm getting the right things out.

I am going to try some of these suggestions and see if things improve. As ever, any further comments or suggestions much appreciated.

Michael

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Initial Assembly Help - Short Reads Large Genome

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News