Seqanswers Leaderboard Ad

**mastal** · 02-14-2014, 02:44 AM

Are you sure it's the contigs.fa file that is so large?

Usually I find it's the .afg file that is very large.

What is the size of the genome you're trying to assemble?

You could reduce the size of the contigs.fa file by setting the -min_contig_lgth parameter so as to remove very short contigs.

Are you using a very new version of velvet? I am not familiar with the 'auto' setting for -ins_length.

**Dagga** · 02-16-2014, 03:11 PM

Thanks!

Yep! I am sure they are in the gig size but they range from 20-50 gigs depending on the kmer setting of velvet. As far as I know, it is the latest version of velvet (v 1.2.10).

All of the afg files are around 2.5gigs.

There is only one genome of this species that has been sequenced and it is approx 12Mbp and I am expecting a genome length of about 8-12Mbp. However, I am sure there was some contamination in the sample so I am expecting a larger assembly.

Thanks for the min length comment! i'll give that a try now and see if that helps.

Cheers

**ctseto** · 02-16-2014, 04:48 PM

Is it repetitive? Polyploid? Have you done a dotplot of your organism against the other related reference?

**Dagga** · 02-17-2014, 12:55 AM

I dont think it is repetitive and it is a bacterial genome so it should be relatively simple. I havent dont a dotplot but I think there would be a few differences to the other genome so i dont think it would be close enough to use as a reference.

**Dagga** · 02-17-2014, 04:15 PM

I think I have found the reason why the contigs.fa files are so large - but I am still not sure how to fix it.

I managed to open one of the smaller contigs.fa files (0.5gig) and have found there are several contigs with very large (30-40Mbp) spans of N's. This has happened for several different contigs and therefore I think this is why the files are so large.

My question now is - does anyone know why this is happening and how I can fix it?

I know about the -scaffolding no command which will completely eliminate N's but I think this is a bit drastic as few N's combining contigs is ok.

Cheers!

**GenoMax** · 02-17-2014, 04:57 PM

Can you find out what fraction of the characters are N's in your files and what fraction are valid bases?

Valid bases:

Code:

$ tr -dc '[ACGT]/i' < test.fa | wc -c

Following should tell you how many "N's" are there:

Code:

$ tr -dc 'N' < test.fa | wc -c

If the N's are outnumbering valid bases then perhaps the assembly is not right.

**GenoMax** · 02-17-2014, 05:01 PM

Mauve is an excellent tool to try to visualize genomes against each other. Pick the closest species available and try your assembly against it. I have a feeling that if you have too many N's this would not work.

**Dagga** · 02-17-2014, 05:19 PM

I have a feeling the insert length could be an issue. We have used an Nextera prep and according to the sequencing centre - this has a variable insert length. I am attempting to reassemble with a set velvetg insert length and see if this helps things.

I will get back to you about the other questions asap

**Dagga** · 02-17-2014, 05:21 PM

and its > 90% N's for about 15 contigs. These are the very large contigs with a length of 15-50 Mbp...

The other contigs seem normal.

**GenoMax** · 02-17-2014, 05:26 PM

Put the "normal" contigs in a file and give mauve a try using a closely related species. That will give you some idea about the quality of the assembly.

Those large contigs with N's will hopefully will be resolved with newer velvet runs.

**Dagga** · 02-17-2014, 05:31 PM

Great!

thanks GenoMax i'll give that a try.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 55 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 52 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Very large contig files using Velvet

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News