SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Unix one-line command for N50 size seqseq Bioinformatics 22 08-07-2014 05:02 AM
velvet N50 bioenvisage De novo discovery 12 07-19-2013 06:50 AM
Anyone can refer good and not too expensive exome sequencing facility? memento Core Facilities 12 02-24-2012 02:41 PM
N50 less than 2000 sarbashis Illumina/Solexa 4 09-07-2011 04:06 AM
SRMA Problem SAMRecord contig does not match the current reference sequence contig gavin.oliver Bioinformatics 5 07-05-2011 06:28 AM

Reply
 
Thread Tools
Old 10-06-2009, 09:56 PM   #1
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default N50 and N90 contig size refer to?

What is the general explanation of N50 and N90 contig size?
Regarding the "Instructions for scaffolding MIRA 454 contigs & 25KB paired-end data with BAMBUS.
Based on the MIRA Assembly Info/Bambus Scaffold info, can I know what is the N50 & N90 contig size refer to?
How can I obtain this value and how to calculate the N50 & N90 contig size?
Thanks a lot for all of your explanation and suggestion.
edge is offline   Reply With Quote
Old 10-06-2009, 11:14 PM   #2
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 621
Default

The N50 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 50% of the total length.

cheers,
Sven
sklages is offline   Reply With Quote
Old 10-06-2009, 11:31 PM   #3
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi,

Thanks for your info.
Do you have any idea about N90?
That means the N50 contig size, I just choose and calculate the smallest contig S in the sorted list of all contigs?
Thanks again for your explanation

Quote:
Originally Posted by sklages View Post
The N50 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 50% of the total length.

cheers,
Sven
edge is offline   Reply With Quote
Old 10-07-2009, 12:56 AM   #4
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 621
Default

Quote:
Originally Posted by edge View Post
Do you have any idea about N90?
I'd say,

The N90 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 90% of the total length.

:-)

Sven
sklages is offline   Reply With Quote
Old 10-07-2009, 12:59 AM   #5
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Thanks for your suggestion

I found out that sometimes the maximum contig size will exact same with the N50 contig size.
Can I know what is the reason ?
Thanks ya. I'm still new with bioinformatics. Learning process now.thus facing more problem

Quote:
Originally Posted by sklages View Post
I'd say,

The N90 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 90% of the total length.

:-)

Sven
edge is offline   Reply With Quote
Old 10-07-2009, 02:01 AM   #6
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

N50 = length-weighted median.: The size of the smallest contig such that 50% of the length of the genome is contained in contigs of size N50 or greater.
N90 is 90%.
If you have done the assembly work, and you have got the contigs in FASTA format, it is easy to calculate the N50 & N90 contig size, for example:
Code:
perl -e 'my ($len,$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if($len>0){$total+=$len;[email protected],$len;};$len=0;}else{s/\s//g;$len+=length($_);}}if ($len>0){$total+=$len;push @x,$len;}@x=sort{$b<=>$a}@x; my ($count,$half)=(0,0);for (my $j=0;$j<@x;$j++){$count+=$x[$j];if(($count>=$total/2)&&($half==0)){print "N50: $x[$j]\n";$half=$x[$j]}elsif($count>=$total*0.9){print "N90: $x[$j]\n";exit;}}'  contigs.fa

Last edited by BENM; 10-07-2009 at 02:38 AM.
BENM is offline   Reply With Quote
Old 10-07-2009, 02:12 AM   #7
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi BENM,

I just try the code that you give it to me.
It can't work d.
Do I miss anything or the code got problem?
After I run the code,the output result is empty d
Thanks for your help ^^
Quote:
Originally Posted by BENM View Post
N50 = length-weighted median.: The size of the smallest contig such that 50% of the length of the genome is contained in contigs of size N50 or greater.
N90 is 90%.
If you have done the assembly work, and you have got the contigs in FASTA format, it is easy to calculate the N50 & N90 contig size, for example:
Code:
perl -e 'my ($len,$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if($len>0){$total+=$len;[email protected],$len;};$len=0;}else{s/\s//g;$len+=length($_);}}if ($len>0){$total+=$len;push @x,$len;}@x=sort{$b<=>$a}@x; my ($count,$half)=(0,0);for (my $j=0;$j<@x;$j++){$count+=$x[$j];if($count>=$total/2){$half=$x[j];print "N50: $x[j]\n" if ($half==0);}elsif($count>=$total*0.9){print "N90: $x[j]\n";exit;}}'  contigs.fa
edge is offline   Reply With Quote
Old 10-07-2009, 02:37 AM   #8
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

Quote:
Originally Posted by edge View Post
Hi BENM,

I just try the code that you give it to me.
It can't work d.
Do I miss anything or the code got problem?
After I run the code,the output result is empty d
Thanks for your help ^^
hi edge,

I am sorry for a little mistake, you can type the below code into a perl script:
Code:
#/usr/bin/perl -w
use strict;
my ($len,$total)=(0,0);
my @x;
while(<>){
	if(/^[\>\@]/){
		if($len>0){
			$total+=$len;
			push @x,$len;
		}
		$len=0;
	}
	else{
		s/\s//g;
		$len+=length($_);
	}
}
if ($len>0){
	$total+=$len;
	push @x,$len;
}
@x=sort{$b<=>$a} @x; 
my ($count,$half)=(0,0);
for (my $j=0;$j<@x;$j++){
	$count+=$x[$j];
	if (($count>=$total/2)&&($half==0)){
		print "N50: $x[$j]\n";
		$half=$x[$j]
	}elsif ($count>=$total*0.9){
		print "N90: $x[$j]\n";
		exit;
	}
}
or run this command as before:
Code:
perl -e 'my ($len,$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if($len>0){$total+=$len;[email protected],$len;};$len=0;}else{s/\s//g;$len+=length($_);}}if ($len>0){$total+=$len;push @x,$len;}@x=sort{$b<=>$a}@x; my ($count,$half)=(0,0);for (my $j=0;$j<@x;$j++){$count+=$x[$j];if(($count>=$total/2)&&($half==0)){print "N50: $x[$j]\n";$half=$x[$j]}elsif($count>=$total*0.9){print "N90: $x[$j]\n";exit;}}' contigs.fa

Last edited by BENM; 10-07-2009 at 02:40 AM.
BENM is offline   Reply With Quote
Old 10-07-2009, 03:01 AM   #9
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Thanks BENM,
It is worked nice now ^^
I very thanks for your help.
edge is offline   Reply With Quote
Old 10-07-2009, 03:03 AM   #10
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi BENM,

Do you have used MIRA software before?
I facing some problem about how they calculate the N50 or N90 about their assembly output result

Quote:
Originally Posted by BENM View Post
hi edge,

I am sorry for a little mistake, you can type the below code into a perl script:
Code:
#/usr/bin/perl -w
use strict;
my ($len,$total)=(0,0);
my @x;
while(<>){
	if(/^[\>\@]/){
		if($len>0){
			$total+=$len;
			push @x,$len;
		}
		$len=0;
	}
	else{
		s/\s//g;
		$len+=length($_);
	}
}
if ($len>0){
	$total+=$len;
	push @x,$len;
}
@x=sort{$b<=>$a} @x; 
my ($count,$half)=(0,0);
for (my $j=0;$j<@x;$j++){
	$count+=$x[$j];
	if (($count>=$total/2)&&($half==0)){
		print "N50: $x[$j]\n";
		$half=$x[$j]
	}elsif ($count>=$total*0.9){
		print "N90: $x[$j]\n";
		exit;
	}
}
or run this command as before:
Code:
perl -e 'my ($len,$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if($len>0){$total+=$len;[email protected],$len;};$len=0;}else{s/\s//g;$len+=length($_);}}if ($len>0){$total+=$len;push @x,$len;}@x=sort{$b<=>$a}@x; my ($count,$half)=(0,0);for (my $j=0;$j<@x;$j++){$count+=$x[$j];if(($count>=$total/2)&&($half==0)){print "N50: $x[$j]\n";$half=$x[$j]}elsif($count>=$total*0.9){print "N90: $x[$j]\n";exit;}}' contigs.fa
edge is offline   Reply With Quote
Old 10-07-2009, 03:31 AM   #11
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

I am using this software, but not familiar. There are *_out.padded.fasta and *_out.unpadded.fasta in the ouput directory of "projectname_d_result". It defined contigs lenth >=500bp are large contigs. So in "projectname_d_info" directory, you can find the information in the file of *_info_assembly.txt.
BENM is offline   Reply With Quote
Old 10-07-2009, 07:25 AM   #12
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 621
Default

*_out.unpadded.fasta should be your firend when calculating contig sizes.

As BENM mentioned there is a lot of info in the info_assembly.txt

Sven
sklages is offline   Reply With Quote
Old 10-07-2009, 05:32 PM   #13
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi,

Do you know what is the difference of usage of *_out.padded.fasta and *_out.unpadded.fasta?
As I know *_out.padded.fasta all are lower capital and *_out.unpadded.fasta all are upper capital. Both of them are the exactly same content.
According to *_info_assembly.txt, I try to calculate the figure inside like N50,N90,minimum contig size and maximum contig size,etc based on the *.contig file at "projectname_d_result".
Unfortunately, the figure I find out can't match with the *_info_assembly.txt
Thus I feel quite confusing about the way they calculated N50,N90,etc at *_info_assembly.txt

Quote:
Originally Posted by BENM View Post
I am using this software, but not familiar. There are *_out.padded.fasta and *_out.unpadded.fasta in the ouput directory of "projectname_d_result". It defined contigs lenth >=500bp are large contigs. So in "projectname_d_info" directory, you can find the information in the file of *_info_assembly.txt.
edge is offline   Reply With Quote
Old 10-07-2009, 06:02 PM   #14
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi sklages,
Thanks for your suggestion.
I face some problems when try to find out the N50,N90,minimum contig size and maximum contig size,etc based on the *.contig file at "projectname_d_result".
The figure I find out can't match with the *_info_assembly.txt
Do you have any idea to calculate the N50,N90,minimum contig size and maximum contig size at *_info_assembly.txt ?
edge is offline   Reply With Quote
Old 10-07-2009, 06:17 PM   #15
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi BENM,
If I got a long list of contents:
scaff_123 20
scaff_223 60
scaff_122 1000
scaff_125 15
scaff_23 30
scaff_13 26
scaff_230 50
scaff_153 500
scaff_173 200

Based on the column two,
Do you have any idea how to calculate the N50 and N90 from this long list of contents?
I need to do descending order of this long list of contents before I calculate the N50 and N90,right?
Thanks again for your help
edge is offline   Reply With Quote
Old 10-08-2009, 12:17 AM   #16
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 621
Default

yes, sort the lengths and calculate N50, e.g.

the sum of all scaffolds is 1901, the biggest contig is 1000, so, in this case, your N50 is 1000, as this is more than 50% of the total contig size ...

For N90 you need to calculate accordingly ..

cheers,
Sven
sklages is offline   Reply With Quote
Old 10-08-2009, 12:20 AM   #17
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi sklages,

Thanks a lot for your explanation.
I fully understand about N50 and N90 now
Really thanks a lot again ^^

Quote:
Originally Posted by sklages View Post
yes, sort the lengths and calculate N50, e.g.

the sum of all scaffolds is 1901, the biggest contig is 1000, so, in this case, your N50 is 1000, as this is more than 50% of the total contig size ...

For N90 you need to calculate accordingly ..

cheers,
Sven
edge is offline   Reply With Quote
Old 10-08-2009, 12:28 AM   #18
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 621
Default

Quote:
Originally Posted by edge View Post
Hi sklages,
Thanks for your suggestion.
I face some problems when try to find out the N50,N90,minimum contig size and maximum contig size,etc based on the *.contig file at "projectname_d_result".
The figure I find out can't match with the *_info_assembly.txt
Do you have any idea to calculate the N50,N90,minimum contig size and maximum contig size at *_info_assembly.txt ?
I do calculate N50 using *_info_contigstats.txt (which gives the same as results as if I use the contigs.fasta file).
This gives the same as the N50 calculated in *_info_assembly.txt (Section "All Contigs"!).

Btw, .. the padded fasta output contains the sequences with pads (if there are pads). The unpadded sequence has all pads been removed and is usually used for further analysis (but this depends on what you are doing).

cheers,
Sven
sklages is offline   Reply With Quote
Old 10-08-2009, 01:02 AM   #19
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi sklages,

You are right d.
I think I do calculate N50 using *_info_contigstats.txt (which gives the same as results as if I use the contigs.fasta file).
At the above sentence, we should use *_out.unpadded.fasta instead of contigs.fasta file to calculate N50,right?
I try using contigs.fasta file to calculate the N50 but it can't match with the
*_info_contigstats.txt. However, *_out.unpadded.fasta can do it
It seem like contigs.fasta file same with the *_out.padded.fasta file, both just the header a bit different, right?
Besides that, refer to the *_info_assembly.txt at Section "All Contigs". I try to use the *_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta to find out the "Total consensus","Number of contigs",etc at Section "All Contigs".
Unfortunately, it can't match with the figure at Section "All Contigs".
Do you have any idea about this problem facing?
It seem like all the figure that I get from the file (*_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta) is lesser than the figure at Section "All Contigs"


Quote:
Originally Posted by sklages View Post
I do calculate N50 using *_info_contigstats.txt (which gives the same as results as if I use the contigs.fasta file).
cheers,
Sven
edge is offline   Reply With Quote
Old 10-08-2009, 01:10 AM   #20
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 621
Default

Quote:
Originally Posted by edge View Post
Hi sklages,

[...]Besides that, refer to the *_info_assembly.txt at Section "All Contigs". I try to use the *_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta to find out the "Total consensus","Number of contigs",etc at Section "All Contigs".
Unfortunately, it can't match with the figure at Section "All Contigs".
Do you have any idea about this problem facing?
It seem like all the figure that I get from the file (*_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta) is lesser than the figure at Section "All Contigs"
Well, you are right. I just checked for this as well. N50 and "largest contig" are "correct" (in a sense how I calculated it), all other numbers differ slightly from what I was counting ...

No idea why ..
Sven
sklages is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:13 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO