SEQanswers N50 and N90 contig size refer to?
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post seqseq Bioinformatics 22 08-07-2014 04:02 AM bioenvisage De novo discovery 12 07-19-2013 05:50 AM memento Core Facilities 12 02-24-2012 01:41 PM sarbashis Illumina/Solexa 4 09-07-2011 03:06 AM gavin.oliver Bioinformatics 5 07-05-2011 05:28 AM

 10-06-2009, 08:56 PM #1 edge Senior Member   Location: China Join Date: Sep 2009 Posts: 199 N50 and N90 contig size refer to? What is the general explanation of N50 and N90 contig size? Regarding the "Instructions for scaffolding MIRA 454 contigs & 25KB paired-end data with BAMBUS. Based on the MIRA Assembly Info/Bambus Scaffold info, can I know what is the N50 & N90 contig size refer to? How can I obtain this value and how to calculate the N50 & N90 contig size? Thanks a lot for all of your explanation and suggestion.
 10-06-2009, 10:14 PM #2 sklages Senior Member   Location: Berlin, DE Join Date: May 2008 Posts: 628 The N50 contig size is a weighted median value and defined as the length of the smallest contig S in the sorted list of all contigs where the cumulative length from the largest contig to contig S is at least 50% of the total length. cheers, Sven
10-06-2009, 10:31 PM   #3
edge
Senior Member

Location: China

Join Date: Sep 2009
Posts: 199

Hi,

Do you have any idea about N90?
That means the N50 contig size, I just choose and calculate the smallest contig S in the sorted list of all contigs?

Quote:
 Originally Posted by sklages The N50 contig size is a weighted median value and defined as the length of the smallest contig S in the sorted list of all contigs where the cumulative length from the largest contig to contig S is at least 50% of the total length. cheers, Sven

10-06-2009, 11:56 PM   #4
sklages
Senior Member

Location: Berlin, DE

Join Date: May 2008
Posts: 628

Quote:
 Originally Posted by edge Do you have any idea about N90?
I'd say,

The N90 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 90% of the total length.

:-)

Sven

10-06-2009, 11:59 PM   #5
edge
Senior Member

Location: China

Join Date: Sep 2009
Posts: 199

I found out that sometimes the maximum contig size will exact same with the N50 contig size.
Can I know what is the reason ?
Thanks ya. I'm still new with bioinformatics. Learning process now.thus facing more problem

Quote:
 Originally Posted by sklages I'd say, The N90 contig size is a weighted median value and defined as the length of the smallest contig S in the sorted list of all contigs where the cumulative length from the largest contig to contig S is at least 90% of the total length. :-) Sven

 10-07-2009, 01:01 AM #6 BENM Member   Location: PRC Join Date: May 2009 Posts: 33 N50 = length-weighted median.: The size of the smallest contig such that 50% of the length of the genome is contained in contigs of size N50 or greater. N90 is 90%. If you have done the assembly work, and you have got the contigs in FASTA format, it is easy to calculate the N50 & N90 contig size, for example: Code: `perl -e 'my (\$len,\$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if(\$len>0){\$total+=\$len;push@x,\$len;};\$len=0;}else{s/\s//g;\$len+=length(\$_);}}if (\$len>0){\$total+=\$len;push @x,\$len;}@x=sort{\$b<=>\$a}@x; my (\$count,\$half)=(0,0);for (my \$j=0;\$j<@x;\$j++){\$count+=\$x[\$j];if((\$count>=\$total/2)&&(\$half==0)){print "N50: \$x[\$j]\n";\$half=\$x[\$j]}elsif(\$count>=\$total*0.9){print "N90: \$x[\$j]\n";exit;}}' contigs.fa` Last edited by BENM; 10-07-2009 at 01:38 AM.
10-07-2009, 01:12 AM   #7
edge
Senior Member

Location: China

Join Date: Sep 2009
Posts: 199

Hi BENM,

I just try the code that you give it to me.
It can't work d.
Do I miss anything or the code got problem?
After I run the code,the output result is empty d
Quote:
 Originally Posted by BENM N50 = length-weighted median.: The size of the smallest contig such that 50% of the length of the genome is contained in contigs of size N50 or greater. N90 is 90%. If you have done the assembly work, and you have got the contigs in FASTA format, it is easy to calculate the N50 & N90 contig size, for example: Code: `perl -e 'my (\$len,\$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if(\$len>0){\$total+=\$len;push@x,\$len;};\$len=0;}else{s/\s//g;\$len+=length(\$_);}}if (\$len>0){\$total+=\$len;push @x,\$len;}@x=sort{\$b<=>\$a}@x; my (\$count,\$half)=(0,0);for (my \$j=0;\$j<@x;\$j++){\$count+=\$x[\$j];if(\$count>=\$total/2){\$half=\$x[j];print "N50: \$x[j]\n" if (\$half==0);}elsif(\$count>=\$total*0.9){print "N90: \$x[j]\n";exit;}}' contigs.fa`

10-07-2009, 01:37 AM   #8
BENM
Member

Location: PRC

Join Date: May 2009
Posts: 33

Quote:
 Originally Posted by edge Hi BENM, I just try the code that you give it to me. It can't work d. Do I miss anything or the code got problem? After I run the code,the output result is empty d Thanks for your help ^^
hi edge,

I am sorry for a little mistake, you can type the below code into a perl script:
Code:
```#/usr/bin/perl -w
use strict;
my (\$len,\$total)=(0,0);
my @x;
while(<>){
if(/^[\>\@]/){
if(\$len>0){
\$total+=\$len;
push @x,\$len;
}
\$len=0;
}
else{
s/\s//g;
\$len+=length(\$_);
}
}
if (\$len>0){
\$total+=\$len;
push @x,\$len;
}
@x=sort{\$b<=>\$a} @x;
my (\$count,\$half)=(0,0);
for (my \$j=0;\$j<@x;\$j++){
\$count+=\$x[\$j];
if ((\$count>=\$total/2)&&(\$half==0)){
print "N50: \$x[\$j]\n";
\$half=\$x[\$j]
}elsif (\$count>=\$total*0.9){
print "N90: \$x[\$j]\n";
exit;
}
}```
or run this command as before:
Code:
`perl -e 'my (\$len,\$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if(\$len>0){\$total+=\$len;push@x,\$len;};\$len=0;}else{s/\s//g;\$len+=length(\$_);}}if (\$len>0){\$total+=\$len;push @x,\$len;}@x=sort{\$b<=>\$a}@x; my (\$count,\$half)=(0,0);for (my \$j=0;\$j<@x;\$j++){\$count+=\$x[\$j];if((\$count>=\$total/2)&&(\$half==0)){print "N50: \$x[\$j]\n";\$half=\$x[\$j]}elsif(\$count>=\$total*0.9){print "N90: \$x[\$j]\n";exit;}}' contigs.fa`

Last edited by BENM; 10-07-2009 at 01:40 AM.

 10-07-2009, 02:01 AM #9 edge Senior Member   Location: China Join Date: Sep 2009 Posts: 199 Thanks BENM, It is worked nice now ^^ I very thanks for your help.
10-07-2009, 02:03 AM   #10
edge
Senior Member

Location: China

Join Date: Sep 2009
Posts: 199

Hi BENM,

Do you have used MIRA software before?
I facing some problem about how they calculate the N50 or N90 about their assembly output result

Quote:
 Originally Posted by BENM hi edge, I am sorry for a little mistake, you can type the below code into a perl script: Code: ```#/usr/bin/perl -w use strict; my (\$len,\$total)=(0,0); my @x; while(<>){ if(/^[\>\@]/){ if(\$len>0){ \$total+=\$len; push @x,\$len; } \$len=0; } else{ s/\s//g; \$len+=length(\$_); } } if (\$len>0){ \$total+=\$len; push @x,\$len; } @x=sort{\$b<=>\$a} @x; my (\$count,\$half)=(0,0); for (my \$j=0;\$j<@x;\$j++){ \$count+=\$x[\$j]; if ((\$count>=\$total/2)&&(\$half==0)){ print "N50: \$x[\$j]\n"; \$half=\$x[\$j] }elsif (\$count>=\$total*0.9){ print "N90: \$x[\$j]\n"; exit; } }``` or run this command as before: Code: `perl -e 'my (\$len,\$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if(\$len>0){\$total+=\$len;push@x,\$len;};\$len=0;}else{s/\s//g;\$len+=length(\$_);}}if (\$len>0){\$total+=\$len;push @x,\$len;}@x=sort{\$b<=>\$a}@x; my (\$count,\$half)=(0,0);for (my \$j=0;\$j<@x;\$j++){\$count+=\$x[\$j];if((\$count>=\$total/2)&&(\$half==0)){print "N50: \$x[\$j]\n";\$half=\$x[\$j]}elsif(\$count>=\$total*0.9){print "N90: \$x[\$j]\n";exit;}}' contigs.fa`

 10-07-2009, 02:31 AM #11 BENM Member   Location: PRC Join Date: May 2009 Posts: 33 I am using this software, but not familiar. There are *_out.padded.fasta and *_out.unpadded.fasta in the ouput directory of "projectname_d_result". It defined contigs lenth >=500bp are large contigs. So in "projectname_d_info" directory, you can find the information in the file of *_info_assembly.txt.
 10-07-2009, 06:25 AM #12 sklages Senior Member   Location: Berlin, DE Join Date: May 2008 Posts: 628 *_out.unpadded.fasta should be your firend when calculating contig sizes. As BENM mentioned there is a lot of info in the info_assembly.txt Sven
10-07-2009, 04:32 PM   #13
edge
Senior Member

Location: China

Join Date: Sep 2009
Posts: 199

Hi,

Do you know what is the difference of usage of *_out.padded.fasta and *_out.unpadded.fasta?
As I know *_out.padded.fasta all are lower capital and *_out.unpadded.fasta all are upper capital. Both of them are the exactly same content.
According to *_info_assembly.txt, I try to calculate the figure inside like N50,N90,minimum contig size and maximum contig size,etc based on the *.contig file at "projectname_d_result".
Unfortunately, the figure I find out can't match with the *_info_assembly.txt
Thus I feel quite confusing about the way they calculated N50,N90,etc at *_info_assembly.txt

Quote:
 Originally Posted by BENM I am using this software, but not familiar. There are *_out.padded.fasta and *_out.unpadded.fasta in the ouput directory of "projectname_d_result". It defined contigs lenth >=500bp are large contigs. So in "projectname_d_info" directory, you can find the information in the file of *_info_assembly.txt.

 10-07-2009, 05:02 PM #14 edge Senior Member   Location: China Join Date: Sep 2009 Posts: 199 Hi sklages, Thanks for your suggestion. I face some problems when try to find out the N50,N90,minimum contig size and maximum contig size,etc based on the *.contig file at "projectname_d_result". The figure I find out can't match with the *_info_assembly.txt Do you have any idea to calculate the N50,N90,minimum contig size and maximum contig size at *_info_assembly.txt ?
 10-07-2009, 05:17 PM #15 edge Senior Member   Location: China Join Date: Sep 2009 Posts: 199 Hi BENM, If I got a long list of contents: scaff_123 20 scaff_223 60 scaff_122 1000 scaff_125 15 scaff_23 30 scaff_13 26 scaff_230 50 scaff_153 500 scaff_173 200 Based on the column two, Do you have any idea how to calculate the N50 and N90 from this long list of contents? I need to do descending order of this long list of contents before I calculate the N50 and N90,right? Thanks again for your help
 10-07-2009, 11:17 PM #16 sklages Senior Member   Location: Berlin, DE Join Date: May 2008 Posts: 628 yes, sort the lengths and calculate N50, e.g. the sum of all scaffolds is 1901, the biggest contig is 1000, so, in this case, your N50 is 1000, as this is more than 50% of the total contig size ... For N90 you need to calculate accordingly .. cheers, Sven
10-07-2009, 11:20 PM   #17
edge
Senior Member

Location: China

Join Date: Sep 2009
Posts: 199

Hi sklages,

Thanks a lot for your explanation.
I fully understand about N50 and N90 now
Really thanks a lot again ^^

Quote:
 Originally Posted by sklages yes, sort the lengths and calculate N50, e.g. the sum of all scaffolds is 1901, the biggest contig is 1000, so, in this case, your N50 is 1000, as this is more than 50% of the total contig size ... For N90 you need to calculate accordingly .. cheers, Sven

10-07-2009, 11:28 PM   #18
sklages
Senior Member

Location: Berlin, DE

Join Date: May 2008
Posts: 628

Quote:
 Originally Posted by edge Hi sklages, Thanks for your suggestion. I face some problems when try to find out the N50,N90,minimum contig size and maximum contig size,etc based on the *.contig file at "projectname_d_result". The figure I find out can't match with the *_info_assembly.txt Do you have any idea to calculate the N50,N90,minimum contig size and maximum contig size at *_info_assembly.txt ?
I do calculate N50 using *_info_contigstats.txt (which gives the same as results as if I use the contigs.fasta file).
This gives the same as the N50 calculated in *_info_assembly.txt (Section "All Contigs"!).

Btw, .. the padded fasta output contains the sequences with pads (if there are pads). The unpadded sequence has all pads been removed and is usually used for further analysis (but this depends on what you are doing).

cheers,
Sven

10-08-2009, 12:02 AM   #19
edge
Senior Member

Location: China

Join Date: Sep 2009
Posts: 199

Hi sklages,

You are right d.
I think I do calculate N50 using *_info_contigstats.txt (which gives the same as results as if I use the contigs.fasta file).
At the above sentence, we should use *_out.unpadded.fasta instead of contigs.fasta file to calculate N50,right?
I try using contigs.fasta file to calculate the N50 but it can't match with the
*_info_contigstats.txt. However, *_out.unpadded.fasta can do it
It seem like contigs.fasta file same with the *_out.padded.fasta file, both just the header a bit different, right?
Besides that, refer to the *_info_assembly.txt at Section "All Contigs". I try to use the *_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta to find out the "Total consensus","Number of contigs",etc at Section "All Contigs".
Unfortunately, it can't match with the figure at Section "All Contigs".
It seem like all the figure that I get from the file (*_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta) is lesser than the figure at Section "All Contigs"

Quote:
 Originally Posted by sklages I do calculate N50 using *_info_contigstats.txt (which gives the same as results as if I use the contigs.fasta file). cheers, Sven

10-08-2009, 12:10 AM   #20
sklages
Senior Member

Location: Berlin, DE

Join Date: May 2008
Posts: 628

Quote:
 Originally Posted by edge Hi sklages, [...]Besides that, refer to the *_info_assembly.txt at Section "All Contigs". I try to use the *_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta to find out the "Total consensus","Number of contigs",etc at Section "All Contigs". Unfortunately, it can't match with the figure at Section "All Contigs". Do you have any idea about this problem facing? It seem like all the figure that I get from the file (*_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta) is lesser than the figure at Section "All Contigs"
Well, you are right. I just checked for this as well. N50 and "largest contig" are "correct" (in a sense how I calculated it), all other numbers differ slightly from what I was counting ...

No idea why ..
Sven