SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   N50 and N90 contig size refer to? (http://seqanswers.com/forums/showthread.php?t=2766)

edge 10-06-2009 08:56 PM

N50 and N90 contig size refer to?
 
What is the general explanation of N50 and N90 contig size?
Regarding the "Instructions for scaffolding MIRA 454 contigs & 25KB paired-end data with BAMBUS.
Based on the MIRA Assembly Info/Bambus Scaffold info, can I know what is the N50 & N90 contig size refer to?
How can I obtain this value and how to calculate the N50 & N90 contig size?
Thanks a lot for all of your explanation and suggestion.

sklages 10-06-2009 10:14 PM

The N50 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 50% of the total length.

cheers,
Sven

edge 10-06-2009 10:31 PM

Hi,

Thanks for your info.
Do you have any idea about N90?
That means the N50 contig size, I just choose and calculate the smallest contig S in the sorted list of all contigs?
Thanks again for your explanation :)

Quote:

Originally Posted by sklages (Post 9034)
The N50 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 50% of the total length.

cheers,
Sven


sklages 10-06-2009 11:56 PM

Quote:

Originally Posted by edge (Post 9037)
Do you have any idea about N90?

I'd say,

The N90 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 90% of the total length.

:-)

Sven

edge 10-06-2009 11:59 PM

Thanks for your suggestion:)

I found out that sometimes the maximum contig size will exact same with the N50 contig size.
Can I know what is the reason ?
Thanks ya. I'm still new with bioinformatics. Learning process now.thus facing more problem :(

Quote:

Originally Posted by sklages (Post 9041)
I'd say,

The N90 contig size is a weighted median value and defined as
the length of the smallest contig S in the sorted list of all
contigs where the cumulative length from the largest contig to
contig S is at least 90% of the total length.

:-)

Sven


BENM 10-07-2009 01:01 AM

N50 = length-weighted median.: The size of the smallest contig such that 50% of the length of the genome is contained in contigs of size N50 or greater.
N90 is 90%.
If you have done the assembly work, and you have got the contigs in FASTA format, it is easy to calculate the N50 & N90 contig size, for example:
Code:

perl -e 'my ($len,$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if($len>0){$total+=$len;push@x,$len;};$len=0;}else{s/\s//g;$len+=length($_);}}if ($len>0){$total+=$len;push @x,$len;}@x=sort{$b<=>$a}@x; my ($count,$half)=(0,0);for (my $j=0;$j<@x;$j++){$count+=$x[$j];if(($count>=$total/2)&&($half==0)){print "N50: $x[$j]\n";$half=$x[$j]}elsif($count>=$total*0.9){print "N90: $x[$j]\n";exit;}}'  contigs.fa

edge 10-07-2009 01:12 AM

Hi BENM,

I just try the code that you give it to me.
It can't work d.
Do I miss anything or the code got problem?
After I run the code,the output result is empty d :confused:
Thanks for your help ^^
Quote:

Originally Posted by BENM (Post 9043)
N50 = length-weighted median.: The size of the smallest contig such that 50% of the length of the genome is contained in contigs of size N50 or greater.
N90 is 90%.
If you have done the assembly work, and you have got the contigs in FASTA format, it is easy to calculate the N50 & N90 contig size, for example:
Code:

perl -e 'my ($len,$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if($len>0){$total+=$len;push@x,$len;};$len=0;}else{s/\s//g;$len+=length($_);}}if ($len>0){$total+=$len;push @x,$len;}@x=sort{$b<=>$a}@x; my ($count,$half)=(0,0);for (my $j=0;$j<@x;$j++){$count+=$x[$j];if($count>=$total/2){$half=$x[j];print "N50: $x[j]\n" if ($half==0);}elsif($count>=$total*0.9){print "N90: $x[j]\n";exit;}}'  contigs.fa


BENM 10-07-2009 01:37 AM

Quote:

Originally Posted by edge (Post 9045)
Hi BENM,

I just try the code that you give it to me.
It can't work d.
Do I miss anything or the code got problem?
After I run the code,the output result is empty d :confused:
Thanks for your help ^^

hi edge,

I am sorry for a little mistake, you can type the below code into a perl script:
Code:

#/usr/bin/perl -w
use strict;
my ($len,$total)=(0,0);
my @x;
while(<>){
        if(/^[\>\@]/){
                if($len>0){
                        $total+=$len;
                        push @x,$len;
                }
                $len=0;
        }
        else{
                s/\s//g;
                $len+=length($_);
        }
}
if ($len>0){
        $total+=$len;
        push @x,$len;
}
@x=sort{$b<=>$a} @x;
my ($count,$half)=(0,0);
for (my $j=0;$j<@x;$j++){
        $count+=$x[$j];
        if (($count>=$total/2)&&($half==0)){
                print "N50: $x[$j]\n";
                $half=$x[$j]
        }elsif ($count>=$total*0.9){
                print "N90: $x[$j]\n";
                exit;
        }
}

or run this command as before:
Code:

perl -e 'my ($len,$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if($len>0){$total+=$len;push@x,$len;};$len=0;}else{s/\s//g;$len+=length($_);}}if ($len>0){$total+=$len;push @x,$len;}@x=sort{$b<=>$a}@x; my ($count,$half)=(0,0);for (my $j=0;$j<@x;$j++){$count+=$x[$j];if(($count>=$total/2)&&($half==0)){print "N50: $x[$j]\n";$half=$x[$j]}elsif($count>=$total*0.9){print "N90: $x[$j]\n";exit;}}' contigs.fa

edge 10-07-2009 02:01 AM

Thanks BENM,
It is worked nice now ^^
I very thanks for your help.

edge 10-07-2009 02:03 AM

Hi BENM,

Do you have used MIRA software before?
I facing some problem about how they calculate the N50 or N90 about their assembly output result :confused:

Quote:

Originally Posted by BENM (Post 9046)
hi edge,

I am sorry for a little mistake, you can type the below code into a perl script:
Code:

#/usr/bin/perl -w
use strict;
my ($len,$total)=(0,0);
my @x;
while(<>){
        if(/^[\>\@]/){
                if($len>0){
                        $total+=$len;
                        push @x,$len;
                }
                $len=0;
        }
        else{
                s/\s//g;
                $len+=length($_);
        }
}
if ($len>0){
        $total+=$len;
        push @x,$len;
}
@x=sort{$b<=>$a} @x;
my ($count,$half)=(0,0);
for (my $j=0;$j<@x;$j++){
        $count+=$x[$j];
        if (($count>=$total/2)&&($half==0)){
                print "N50: $x[$j]\n";
                $half=$x[$j]
        }elsif ($count>=$total*0.9){
                print "N90: $x[$j]\n";
                exit;
        }
}

or run this command as before:
Code:

perl -e 'my ($len,$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if($len>0){$total+=$len;push@x,$len;};$len=0;}else{s/\s//g;$len+=length($_);}}if ($len>0){$total+=$len;push @x,$len;}@x=sort{$b<=>$a}@x; my ($count,$half)=(0,0);for (my $j=0;$j<@x;$j++){$count+=$x[$j];if(($count>=$total/2)&&($half==0)){print "N50: $x[$j]\n";$half=$x[$j]}elsif($count>=$total*0.9){print "N90: $x[$j]\n";exit;}}' contigs.fa


BENM 10-07-2009 02:31 AM

I am using this software, but not familiar. There are *_out.padded.fasta and *_out.unpadded.fasta in the ouput directory of "projectname_d_result". It defined contigs lenth >=500bp are large contigs. So in "projectname_d_info" directory, you can find the information in the file of *_info_assembly.txt.

sklages 10-07-2009 06:25 AM

*_out.unpadded.fasta should be your firend when calculating contig sizes.

As BENM mentioned there is a lot of info in the info_assembly.txt

Sven

edge 10-07-2009 04:32 PM

Hi,

Do you know what is the difference of usage of *_out.padded.fasta and *_out.unpadded.fasta?
As I know *_out.padded.fasta all are lower capital and *_out.unpadded.fasta all are upper capital. Both of them are the exactly same content.
According to *_info_assembly.txt, I try to calculate the figure inside like N50,N90,minimum contig size and maximum contig size,etc based on the *.contig file at "projectname_d_result".
Unfortunately, the figure I find out can't match with the *_info_assembly.txt :confused:
Thus I feel quite confusing about the way they calculated N50,N90,etc at *_info_assembly.txt :confused:

Quote:

Originally Posted by BENM (Post 9055)
I am using this software, but not familiar. There are *_out.padded.fasta and *_out.unpadded.fasta in the ouput directory of "projectname_d_result". It defined contigs lenth >=500bp are large contigs. So in "projectname_d_info" directory, you can find the information in the file of *_info_assembly.txt.


edge 10-07-2009 05:02 PM

Hi sklages,
Thanks for your suggestion.
I face some problems when try to find out the N50,N90,minimum contig size and maximum contig size,etc based on the *.contig file at "projectname_d_result".
The figure I find out can't match with the *_info_assembly.txt
Do you have any idea to calculate the N50,N90,minimum contig size and maximum contig size at *_info_assembly.txt ?

edge 10-07-2009 05:17 PM

Hi BENM,
If I got a long list of contents:
scaff_123 20
scaff_223 60
scaff_122 1000
scaff_125 15
scaff_23 30
scaff_13 26
scaff_230 50
scaff_153 500
scaff_173 200

Based on the column two,
Do you have any idea how to calculate the N50 and N90 from this long list of contents?
I need to do descending order of this long list of contents before I calculate the N50 and N90,right?
Thanks again for your help :)

sklages 10-07-2009 11:17 PM

yes, sort the lengths and calculate N50, e.g.

the sum of all scaffolds is 1901, the biggest contig is 1000, so, in this case, your N50 is 1000, as this is more than 50% of the total contig size ...

For N90 you need to calculate accordingly ..

cheers,
Sven

edge 10-07-2009 11:20 PM

Hi sklages,

Thanks a lot for your explanation.
I fully understand about N50 and N90 now :)
Really thanks a lot again ^^

Quote:

Originally Posted by sklages (Post 9105)
yes, sort the lengths and calculate N50, e.g.

the sum of all scaffolds is 1901, the biggest contig is 1000, so, in this case, your N50 is 1000, as this is more than 50% of the total contig size ...

For N90 you need to calculate accordingly ..

cheers,
Sven


sklages 10-07-2009 11:28 PM

Quote:

Originally Posted by edge (Post 9100)
Hi sklages,
Thanks for your suggestion.
I face some problems when try to find out the N50,N90,minimum contig size and maximum contig size,etc based on the *.contig file at "projectname_d_result".
The figure I find out can't match with the *_info_assembly.txt
Do you have any idea to calculate the N50,N90,minimum contig size and maximum contig size at *_info_assembly.txt ?

I do calculate N50 using *_info_contigstats.txt (which gives the same as results as if I use the contigs.fasta file).
This gives the same as the N50 calculated in *_info_assembly.txt (Section "All Contigs"!).

Btw, .. the padded fasta output contains the sequences with pads (if there are pads). The unpadded sequence has all pads been removed and is usually used for further analysis (but this depends on what you are doing).

cheers,
Sven

edge 10-08-2009 12:02 AM

Hi sklages,

You are right d.
I think I do calculate N50 using *_info_contigstats.txt (which gives the same as results as if I use the contigs.fasta file).
At the above sentence, we should use *_out.unpadded.fasta instead of contigs.fasta file to calculate N50,right?
I try using contigs.fasta file to calculate the N50 but it can't match with the
*_info_contigstats.txt. However, *_out.unpadded.fasta can do it :)
It seem like contigs.fasta file same with the *_out.padded.fasta file, both just the header a bit different, right?
Besides that, refer to the *_info_assembly.txt at Section "All Contigs". I try to use the *_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta to find out the "Total consensus","Number of contigs",etc at Section "All Contigs".
Unfortunately, it can't match with the figure at Section "All Contigs".
Do you have any idea about this problem facing?
It seem like all the figure that I get from the file (*_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta) is lesser than the figure at Section "All Contigs":(


Quote:

Originally Posted by sklages (Post 9107)
I do calculate N50 using *_info_contigstats.txt (which gives the same as results as if I use the contigs.fasta file).
cheers,
Sven


sklages 10-08-2009 12:10 AM

Quote:

Originally Posted by edge (Post 9109)
Hi sklages,

[...]Besides that, refer to the *_info_assembly.txt at Section "All Contigs". I try to use the *_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta to find out the "Total consensus","Number of contigs",etc at Section "All Contigs".
Unfortunately, it can't match with the figure at Section "All Contigs".
Do you have any idea about this problem facing?
It seem like all the figure that I get from the file (*_info_contigstats.txt/contigs.fasta file/*_out.unpadded.fasta) is lesser than the figure at Section "All Contigs":(

Well, you are right. I just checked for this as well. N50 and "largest contig" are "correct" (in a sense how I calculated it), all other numbers differ slightly from what I was counting ...

No idea why ..
Sven


All times are GMT -8. The time now is 02:16 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.