SEQanswers (
-   Bioinformatics (
-   -   de novo assembly using velvet and Amos (

mandar.bobade60 09-15-2014 07:29 AM

de novo assembly using velvet and Amos
Hi all,

I am newbie in NGS data analysis and first task in it come to me is de novo assembly. I have plant mitochondrial genome to assemble for which I have almost 24GB of data with R1 and R2 each.
However, I have through with QC analysis and velvet output. The output with velvet I have got is contigs.fa files for multiple kmers, as 55, 95, 10. The read length is 101.

I got statistics for kmer 85 contig.fa using quast which is as follows:

Assembly contigs
# contigs (>= 0 bp) 2933
# contigs (>= 1000 bp) 274
Total length (>= 0 bp) 2145071
Total length (>= 1000 bp) 1433182
# contigs 441
Largest contig 62822
Total length 1548880
GC (%) 45.35
N50 7528
N75 3479
L50 51
L75 129
# N's per 100 kbp 0.00

But I am stuck here now, since I am not getting idea how to say this is good to proceed with or bad to go with something else. Also, it would be of great help if anyone suggest me further steps to be taken to arrive at well assembled genome.


WhatsOEver 09-16-2014 12:13 AM

On a first glance, the mitochondrium seems quite enormous in size with a really low GC-content.
I would therefore assume, that you have whole genome sequencing data and that you didn't filter your reads in any way, did you?
Can you tell from which organism this is?
Does "24GB of data" mean you have 2x12GB fastq read files or you have 24Gbp of sequence information?

The following paper on mitochondrial genome assembly from WGS might also be of interest for you:

mandar.bobade60 09-16-2014 02:59 AM

Thank you WhatsOEver for your paper link.

It's only mitochondrial data with 24GB for each end, so collectively 48GB. But coverage is huge thats why data is too much. The only filtering are done using FASTQC and FastUniq.

Brian Bushnell 09-16-2014 09:48 AM

I highly recommend subsampling that data; you have way too much to get a good assembly. Hard to say how much you need since mito vary in size. I'd start by subsampling by a factor of 200 and assembling again to get a better idea of how big the genome is (or you could estimate the size from a kmer frequency plot). Then, if you want to assemble with Velvet, subsample again or normalize to around 40x coverage.

You can subsample paired reads with my reformat tool, which will keep the pairing intact.

mandar.bobade60 09-23-2014 02:54 AM

Dear Brian Bushnell,
I did subsampling and after subsmapling N50 value is getting substantially increased.
I have 101300000 reads with expected mitochondrial genome size of 715000 base pairs.
But problem persists even after picking file with less contig numbers (around 90-100) with good N50 is that the alignment result with raw reads to its contig file is horrible (almost 91% failure).

Can anyone let me know further processing? Since genome is mitochondrial, I don't have much options also for multiple seq alignment with related fasta files.

Brian Bushnell 09-23-2014 11:10 AM

You still have ~14000x coverage which is way too high. Like I said, you need to target closer to 40x coverage, or at least, no more than 100x.

BLAST your contigs to see what they are, and blast a few unaligned reads to see what those are. You could have massive contamination. And anyway, it seems unlikely that you have 24GB of data on a mitochondria. Why would anyone do that? It's very wasteful experimental design.

All times are GMT -8. The time now is 08:56 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.