SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
velvet N50 bioenvisage De novo discovery 12 07-19-2013 05:50 AM
n50 value for transcriptome assembly Ramprasad Bioinformatics 0 10-16-2011 10:18 PM
N50 less than 2000 sarbashis Illumina/Solexa 4 09-07-2011 03:06 AM
Optimal k-mer and N50? AronaldJ De novo discovery 1 12-28-2010 09:03 AM
absolute k-mer coverage explained (Abyss) harrb Bioinformatics 1 12-21-2010 01:01 PM

Reply
 
Thread Tools
Old 08-17-2009, 06:13 AM   #1
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 150
Default N50 explained

Could someone please explain the meaning of N50 - the values provide a standard measure of assembly connectivity.


Martin
maasha is offline   Reply With Quote
Old 08-17-2009, 08:54 AM   #2
jnfass
Member
 
Location: Davis, CA

Join Date: Aug 2008
Posts: 86
Default

from a Broad Institute site:

"N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly.

Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's; the N50 of L is the median of L' , which is 6. "

The original site has a self-issued security certificate, but you can accept it and look here:
https://www.broad.harvard.edu/crd/wiki/index.php/N50
jnfass is offline   Reply With Quote
Old 08-18-2009, 07:04 AM   #3
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 901
Default

Hah, I just answered this question yesterday for one of my customers. While the Broad Institute's definition is good I find it overly mathematical. I sent my customer this quote from a paper on rice by Wing & Jackson (and others -- I am too lazy to look up the original paper):

"Contig or scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value".

Hope that helps.
westerman is offline   Reply With Quote
Old 08-18-2009, 08:32 AM   #4
jnfass
Member
 
Location: Davis, CA

Join Date: Aug 2008
Posts: 86
Default

yah, I'd have to say that I don't calculate N50 in my own scripts in the way the Broad quote above describes. Rather, I sort the list of lengths from low to high. Then, starting with the longest sequences first, I subtract one sequence length at a time from the total length (total number of bases), until I reach one half of the total length. The sequence length I just subtracted (or the longest remaining length .. one could quibble) is the N50.
jnfass is offline   Reply With Quote
Old 04-15-2010, 03:32 AM   #5
Lordish
Junior Member
 
Location: Paris

Join Date: Apr 2010
Posts: 1
Default

Hello there

well I know that this might be a late reply; however, I was looking for a definition for N50 and I couldn't really understand what it means until I found a good definition that I felt like sharing with you:

N50 is the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. The N50 statistics for different assemblies are not comparable unless each is calculated using the same combined length value.

This has been copied from an article for Jason Miller et al published in 2010 and entitled "Assembly algorithms for next-generation sequencing data".
Lordish is offline   Reply With Quote
Old 05-11-2011, 08:44 AM   #6
saml
Junior Member
 
Location: Uppsala, Sweden

Join Date: Oct 2010
Posts: 4
Default 50%, of total bp-length of assembly, or genome?

We had a discussion on this in a uni lab today. There was some confusion about what the 50% are referring to, if it is 50% of the (base pair) length of all contigs in the assembly, or the genome size (when known).

Also the wikipedia page states (as one of two definitions) that it is compared to the genome size.

Is this plain wrong, or is N50 used with this definition (50% referring to genome size, when this is known) in any literature, for example?

Last edited by saml; 05-11-2011 at 08:49 AM.
saml is offline   Reply With Quote
Old 05-11-2011, 12:41 PM   #7
jnfass
Member
 
Location: Davis, CA

Join Date: Aug 2008
Posts: 86
Default

I think there's some diversity in the literature, but I wouldn't claim to have seen all of the places its definition is discussed. The wikipedia article's 1st and 2nd paragraph are inconsistent, it seems to me. The first defines N50 solely based on a set of sequences, in which case the weighted median (Broad-style) definition would apply. But the second paragraph defines it with reference to the genome length, outside of the set of sequences. In my group, we've been informally calling this the "N(X) statistic." However, there's also been a lot of discussion generated by our work (with David Haussler's and Ian Korf's labs) on the Assemblathon competition, associated with the Genome Assembly Workshop (attached to the Genomes 10K meeting this past March). We found mentions of "LG50" and "NG50" statistics on the web (don't recall where), where the LG50 referred to the minimum length, and NG50 referred to the number of contigs, in the set of longest contigs whose length sums to half the genome length (G for genome). I like the L for length, N for number part, but that didn't catch on in the Assemblathon discussions.
jnfass is offline   Reply With Quote
Old 05-12-2011, 03:45 AM   #8
saml
Junior Member
 
Location: Uppsala, Sweden

Join Date: Oct 2010
Posts: 4
Default

Thanks jnfass, for the information! Indeed, the two definitions in the wikipedia article seems contradictory, and to me the "Broad-style" definition seems to make more sense, as a quality of the assembly (otherwise you're actually incorporating a measure of over-all coverage of the genome, as well, which is not the indent, no?).

Cheers,
Samuel
saml is offline   Reply With Quote
Old 08-05-2011, 02:06 AM   #9
vyahhi
Junior Member
 
Location: St. Petersburg / San Diego

Join Date: Feb 2011
Posts: 4
Default

By the way, the definition and example from the Broad Institute site https://www.broad.harvard.edu/crd/wiki/index.php/N50 say different things.

By definition:
Quote:
Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N.
so N50 of {2, 2, 2, 3, 3, 4, 8, 8} is 5. Well, actually 4 + small epsilon, but suppose N50 is an integer.

But by example:
Quote:
This can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's; the N50 of L is the median of L' , which is 6.
N50 is 6.

The problem is that by definition it's the next integer after the length of sequence which contains 50%'th base = 4+1, but by example it's the middle between this sequence and the next one = (4+8)/2.

Last edited by vyahhi; 08-05-2011 at 02:26 AM.
vyahhi is offline   Reply With Quote
Old 08-05-2011, 08:27 AM   #10
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

the N50 (L50) should actually be a value in the list of contigs lengths- the Broad "mathematical definition" is faulty since the median is not even a set member
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 11-14-2012, 04:06 AM   #11
syfo
Just a member
 
Location: Southern EU

Join Date: Nov 2012
Posts: 96
Default

Quote:
Originally Posted by Zigster View Post
the N50 (L50) should actually be a value in the list of contigs lengths- the Broad "mathematical definition" is faulty since the median is not even a set member
If the N50 has to be the length of an actual contig then in the Wikipedia example

Quote:
Originally Posted by jnfass View Post
L = {2, 2, 2, 3, 3, 4, 8, 8}
it should be 8 and not 6 (the two longest contigs contain half of the total assembly).


Similarly, the definition from the assemblathon paper

Quote:
The N50 of an assembly is a weighted median of the lengths of the sequences it contains, equal to the length of the longest sequence s, such that the sum of the lengths of sequences greater than or equal in length to s is greater than or equal to half the length of the genome being assembled. As the length of the genome being as- sembled is generallyunknown, the normal approximation is to use the total length of all of the sequences in an assembly as a proxy for the denominator.
also sounds quite ambiguous to me because the "weighted median" is not necessarily "the length of the longest sequence".
syfo is offline   Reply With Quote
Old 06-12-2013, 06:34 PM   #12
danwiththeplan
Member
 
Location: Auckland

Join Date: Sep 2011
Posts: 44
Default

I also don't see why the N50 (as defined) is a length and, in a paper I'm doing right now, the L50 (as defined) is a number. It's totally counterintuitive and confusing. The one starting with L is a number and the one starting with N is a length? Whaaa?
danwiththeplan is offline   Reply With Quote
Old 06-13-2013, 08:52 AM   #13
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 901
Default

@danwiththeplan: Your question doesn't make much sense since in this forum thread N50 is also known as L50. Since they are one and the same then there is no 'Whaaa?' involved -- they both represent a length in bases. Now if you have a paper you are reading that uses L50 in a difference sense then give us a reference to the paper so we can see what they mean. OTOH, if you are writing the paper (your words ".. in a paper I'm doing ..." is a confusing as to if you are reading, reviewing or writing the paper) then you can make up a new definition of L50 if you wish since it shouldn't conflict badly with other definitions.
westerman is offline   Reply With Quote
Old 07-08-2013, 07:58 AM   #14
kbradnam
Member
 
Location: Davis, CA

Join Date: May 2011
Posts: 34
Default

Maybe it is worth remembering why we even have N50 as a statistic. The final assembly for any genome project, i.e. the one that is described in a paper or uploaded to a database, is not necessarily the biggest assembly that was generated.

This is because there will always be contigs that are too short to be of any use to anyone. In the pre-NGS era of assembly, these contigs could potentially be represented by a single Sanger read that didn't align to anything else.

However, one might consider that there is still some utility in including a long Sanger read in an assembly, even if it doesn't overlap with any other reads. After quality trimming, there are many Sanger reads that are over 1,000 bp in length and this is long enough to contain an exon or two, or even an entire gene (depending on the species). But what if a contig was 500 bp? Or 250 bp? Or 100 bp? Where do you draw the line?

Clearly it is not desirable to fill up an assembly with an abundance of overly short sequences, even if they represent accurate, and unique, biological sequences from the genome of interest. So it has always been common to to remove very short reads from an assembly. The problem is that different groups might use very different criteria for what to keep and what to exclude. Imagine a simple assembly consisting of the following contigs:

+ 5,000 x 100 bp
+ 100 x 10 Kbp
+ 10 x 1 Mbp
+ 1 x 10 Mbp

Now let's say that Distinguished Bioinformatics Center #1 decides to produce an assembly (DBC1) that includes all of these contigs. However, another DBC decides to make another assembly from the same data, but they remove the 100 bp contigs. A third DBC decides to also remove the 10 Kbp contigs. What does this do to the *mean* contig length of each assembly?

# Mean contig lengths #
+ DBC1 = 4,207 bp
+ DBC2 = 189,189 bp
+ DBC3 = 1,818,182 bp

Hopefully it becomes obvious that if you only considered mean contig length, it would be extremely easy to 'game' the system by deliberately excluding shorter contigs (no matter their utility) just to increase the average contig length. But what do we get if we instead rely on N50?

# N50 contig lengths #
+ DBC1 = 1 Mbp
+ DBC2 = 1 Mbp
+ DBC3 = 10 Mbp

This now reduces the overall discrepancies, and puts DBC1 and DBC2 on an equal footing. But you might still, naively, conclude that DBC3 is the better assembly, and if you were extremely wide-eyed and innocent, then maybe you would conclude that DBC3 was *ten* times better than the other two assemblies.

So N50 does a better, though still imperfect, job of avoiding the dangers inherent in relying on the mean length. In some ways, the actual method you use to calculate N50 does not matter too much, as long as you the same method when comparing all assemblies. Back in the day of Sanger-sequence derived assemblies, it was fairly common to see assembly statistics report not just N50, but everything from N10 through to N90. This gives you a much better insight into the overall variation in contig (or scaffold lengths).

In the Assemblathon and Assemblathon 2 contests, we actually plotted all N(x) values to see the full distribution of contig lengths. Except, we didn't use the N50 metric, we used something called NG50. This normalizes for differences in overall assembly sizes by asking 'at what contig/scaffold length — from a set of sorted scaffold lengths — do we see a sum length that is greater than 50% of the **estimated or known genome size**?'

Returning to my earlier fictional example, lets assume that the estimated genome size that was being assembled was 25 Mbp. This means we want to see what contig length takes us past 12.5 Mbp (when summing all contig lengths from longest to shortest):

# NG50 contig lengths #
+ DBC1 = 1 Mbp
+ DBC2 = 1 Mbp
+ DBC3 = 1 Mbp

We now see parity between all assemblies. There are still differences, but these differences just reflect variations in short sequences which may, or may not, be of any utility to the end user of the assembly. In my mind, this gives us a much fairer way of comparing the assemblies, at least in terms of their contig lengths.

## In summary ##

N50 is just a better, though still flawed, way of calculating an average sequence length from a set of sequences. It can still be biased, and you really should consider what the estimated/known genome size is (when comparing two or more assemblies of the same species), and/or look to see the full distribution of sequence lengths (e.g. with an NG(x) graph).
kbradnam is offline   Reply With Quote
Reply

Tags
assembly, n50, sequence assembly

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:50 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.