Seqanswers Leaderboard Ad

**sklages** · 10-21-2009, 10:47 AM

The perl stuff is faster, at least if you have a few 100,000 of contigs

Sven

**seb567** · 10-23-2009, 09:26 AM

There is a program called getN50 in the amos package.

http://amos.sf.net/

http://downloads.sourceforge.net/project/amos/amos/2.0.8/amos-2.0.8.tar.gz?use_mirror=iweb

**flxlex** · 10-23-2009, 12:05 PM

A single awk, no pipes (except between sort and awk), and thereby somewhat shorter. Note sort -n, not sort -rn

sort -n contig_lengths.txt | awk '{len[i++]=$1;sum+=$1} END {for (j=0;j<i+1;j++) {csum+=len[j]; if (csum>sum/2) {print len[j];break}}}'

**eslondon** · 11-03-2009, 10:42 AM

Perl clearly 10 times faster...

macbook:~$ time perl -ne 'chomp(); push(@contigs,$_);$total+=$_;END{foreach(sort{$b<=>$a}@contigs){$sum+=$_;$L=$_;if($sum>=$total*0.5){print "TOTAL: $total\nN50 : $L\n";exit;} ;}}' contigs_100.lines
TOTAL: 59789620
N50 : 212

real 0m0.444s
user 0m0.348s
sys 0m0.038s
macbook:~ $ time sort -n contigs_100.lines | awk '{len[i++]=$1;sum+=$1} END {for (j=0;j<i+1;j++) {csum+=len[j]; if (csum>sum/2) {print len[j];break}}}'
212

real 0m5.502s
user 0m4.055s
sys 0m0.560s

**flxlex** · 11-04-2009, 02:25 AM

Cool. Well, consider it an exercise in awk rather than an attempt to beat perl...

**vyahhi** · 08-05-2011, 02:25 AM

maubp's code on Python requires large amount of memory and CPU if numbers in the list are huge (it creates X copies of every number X).

I suggests to use this faster function in Python for calculating N50 based on this definition http://seqanswers.com/forums/showpos...6&postcount=4:

PHP Code:


def N50(numlist):

  """

  Abstract: Returns the N50 value of the passed list of numbers.

  Usage: N50(numlist)



  Based on the definition from this SEQanswers post

  http://seqanswers.com/forums/showpost.php?p=7496&postcount=4

  (modified Broad Institute's definition

  https://www.broad.harvard.edu/crd/wiki/index.php/N50)

  

  See SEQanswers threads for details:

  http://seqanswers.com/forums/showthread.php?t=2857

  http://seqanswers.com/forums/showthread.php?t=2332

  """

  numlist.sort(reverse = True)

  s = sum(numlist)

  limit = s * 0.5

  for l in numlist:

    s -= l

    if s <= limit:

      return l

Originally posted by maubp View Post

OK - so the stdin is one integer per line. How about a python script like this,
see also http://seqanswers.com/forums/showthread.php?t=2332

Code:

#!/usr/bin/python
import sys

def N50(numlist):
    """
    Abstract: Returns the N50 value of the passed list of numbers. 
    Usage:    N50(numlist)

    Based on the Broad Institute definition:
    https://www.broad.harvard.edu/crd/wiki/index.php/N50
    """
    numlist.sort()
    newlist = []
    for x in numlist :
        newlist += [x]*x
    # take the mean of the two middle elements if there are an even number
    # of elements.  otherwise, take the middle element
    if len(newlist) % 2 == 0:
        medianpos = len(newlist)/2  
        return float(newlist[medianpos] + newlist[medianpos-1]) /2
    else:
        medianpos = len(newlist)/2
        return newlist[medianpos]

assert N50([2, 2, 2, 3, 3, 4, 8, 8]) == 6

lengths = []
for line in sys.stdin :
    lengths.append(int(line))
print N50(lengths)

Then at the Unix command line, you could use it like this:

Code:

$ grep "^>" 454AllContigs.fna | cut -d"=" -f2 | cut -d" " -f1 | ./stdin_N50.py 
386

**ebioman** · 07-30-2014, 06:46 AM

Originally posted by flxlex View Post

A single awk, no pipes (except between sort and awk), and thereby somewhat shorter. Note sort -n, not sort -rn

sort -n contig_lengths.txt | awk '{len[i++]=$1;sum+=$1} END {for (j=0;j<i+1;j++) {csum+=len[j]; if (csum>sum/2) {print len[j];break}}}'

So most scripts here actually take the scaffold length at which the sum is bigger than half the genome size. But should it not be equal AND/OR bigger ?

The N50 of an assembly is a weighted median of the lengths of the sequences it contains, equal to the length of the longest sequence s, such that the sum of the lengths of sequences greater than or equal in length to s is greater than or equal to half the length of the genome being assembled.

from the Assemblathon paper

**flxlex** · 08-07-2014, 04:02 AM

Originally posted by ebioman View Post

But should it not be equal AND/OR bigger?

Hehe, you're right, that is a mistake. The correct version is

Code:

sort -n contig_lengths.txt | awk '{len[i++]=$1;sum+=$1} END {for (j=0;j<i+1;j++) {csum+=len[j]; if (csum>=sum/2) {print len[j];break}}}'

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News