SEQanswers Unix one-line command for N50 size
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post edge Bioinformatics 27 07-13-2015 03:13 PM louis7781x Bioinformatics 2 12-04-2011 05:51 AM Pawan Noel Bioinformatics 6 11-16-2010 10:42 AM ice RNA Sequencing 6 09-02-2010 03:25 PM lamasmi Bioinformatics 2 08-17-2010 09:32 AM

 10-16-2009, 03:40 AM #1 seqseq Junior Member   Location: Spain Join Date: Oct 2009 Posts: 7 Unix one-line command for N50 size Is there a good one-liner for calculating the N50 size of a list of contig sizes? Average size is easy with awk but I didn't manage to find an easy way for the N50 size so far. Thanks!
 10-16-2009, 04:20 AM #2 quinlana Senior Member   Location: Charlottesville Join Date: Sep 2008 Posts: 119 If you have the mean, you can clearly grep/awk your relevant file for the correct data to compute N50. Why not just write a little generic statistics program that computes descriptive stats for a file or for data passed via stdin? Here's a Python function for median (or you could use numpy) that you could dump into a little script. This would compute the N50... PHP Code: ``` def median (numlist):     """     Abstract: Returns the median value of the passed list of numbers.      Usage:    median(numlist)     """     numlist.sort()     # take the mean of the two middle elements if there are an even number     # of elements.  otherwise, take the middle element     if len(numlist) % 2 == 0:         medianpos = len(numlist)/2           median = float(numlist[medianpos] + numlist[medianpos-1]) /2     else:         medianpos = len(numlist)/2         median = numlist[medianpos]     return median  ``` The upside of writing such a program is that, if written correctly, it can be re-used for asking basic questions of your data over and over again. Good luck, Aaron
 10-16-2009, 05:02 AM #3 seqseq Junior Member   Location: Spain Join Date: Oct 2009 Posts: 7 Well, I am looking for a Unix command-line solution.. (btw if I understand your code correctly it is calculating the median, but median is not the same as N50 size)
 10-16-2009, 05:07 AM #4 quinlana Senior Member   Location: Charlottesville Join Date: Sep 2008 Posts: 119 My apologies, my recollection was that N50 was merely the median size. By actually looking up the definition, I see it's not.
 10-16-2009, 05:09 AM #5 maubp Peter (Biopython etc)   Location: Dundee, Scotland, UK Join Date: Jul 2009 Posts: 1,543 You can call Python scripts from the command line, and pipe stdin and stout too - great for use at the Unix command line. Are you planning to supply the list of lengths as space separated integers? e.g. "120 2500 745 63 ..."? That would be easy to parse in Python from a string into a list of integers.
 10-16-2009, 05:18 AM #6 seqseq Junior Member   Location: Spain Join Date: Oct 2009 Posts: 7 It comes up in several occasions, e.g. a file containing contig1 length=1000 reads=582 contig2 length=1500 reads=900 contig3 length=1400 reads=766 ... I want to do something like: \$ cut -f2 file | cut -d"=" -f2 | sort -nr | awk-or-whatever-for-the-N50-size The files can have different formats, I just have a sorted list of contig sizes at some point and it would be useful to have a command to get the N50 size.
 10-16-2009, 05:29 AM #7 maubp Peter (Biopython etc)   Location: Dundee, Scotland, UK Join Date: Jul 2009 Posts: 1,543 OK - so the stdin is one integer per line. How about a python script like this, see also http://seqanswers.com/forums/showthread.php?t=2332 Code: ```#!/usr/bin/python import sys def N50(numlist): """ Abstract: Returns the N50 value of the passed list of numbers. Usage: N50(numlist) Based on the Broad Institute definition: https://www.broad.harvard.edu/crd/wiki/index.php/N50 """ numlist.sort() newlist = [] for x in numlist : newlist += [x]*x # take the mean of the two middle elements if there are an even number # of elements. otherwise, take the middle element if len(newlist) % 2 == 0: medianpos = len(newlist)/2 return float(newlist[medianpos] + newlist[medianpos-1]) /2 else: medianpos = len(newlist)/2 return newlist[medianpos] assert N50([2, 2, 2, 3, 3, 4, 8, 8]) == 6 lengths = [] for line in sys.stdin : lengths.append(int(line)) print N50(lengths)``` Then at the Unix command line, you could use it like this: Code: ```\$ grep "^>" 454AllContigs.fna | cut -d"=" -f2 | cut -d" " -f1 | ./stdin_N50.py 386``` Last edited by maubp; 10-16-2009 at 06:23 AM. Reason: Switched from median to N50
 10-16-2009, 06:06 AM #8 seqseq Junior Member   Location: Spain Join Date: Oct 2009 Posts: 7 What about the awk experts out there? Any idea for the direct command-line version without script? (you have to store the script somewhere, and if you are using another computer you don't have it at hand... I actually really prefer code in the command line instead of applying a script - if possible..)
 10-16-2009, 06:24 AM #9 maubp Peter (Biopython etc)   Location: Dundee, Scotland, UK Join Date: Jul 2009 Posts: 1,543 I rather suspect that an awk solution would be so long that it too would need to be stored in a script. A one liner would be impressive.
10-18-2009, 11:20 PM   #10
sklages
Senior Member

Location: Berlin, DE

Join Date: May 2008
Posts: 628

Quote:
 Originally Posted by seqseq It comes up in several occasions, e.g. a file containing contig1 length=1000 reads=582 contig2 length=1500 reads=900 contig3 length=1400 reads=766 ... I want to do something like: \$ cut -f2 file | cut -d"=" -f2 | sort -nr | awk-or-whatever-for-the-N50-size The files can have different formats, I just have a sorted list of contig sizes at some point and it would be useful to have a command to get the N50 size.

Code:
`perl -lanF'[\s=]' -e 'push(@contigs,\$F[2]);\$total+=\$F[2];END{foreach(sort{\$b<=>\$a}@contigs){\$sum+=\$_;\$L=\$_;if(\$sum>=\$total*0.5){print "TOTAL: \$total\nN50  : \$L";exit;}  ;}}' file`
This is a quick shot. There might be shorter solution. Test it to see wether I calculate correctly ..

cheers,
Sven

 10-20-2009, 07:33 AM #11 nhansen Junior Member   Location: Maryland, USA Join Date: Sep 2009 Posts: 6 Here's a stab at a one-liner : sort -rn contig_lengths.txt | awk 'BEGIN {sum=0} {sum += \$1; print \$1, sum}' | tac - | awk 'NR==1 {halftot=\$2/2} lastsize>halftot && \$2
 10-20-2009, 07:44 AM #12 maubp Peter (Biopython etc)   Location: Dundee, Scotland, UK Join Date: Jul 2009 Posts: 1,543 Impressive Assuming no errors, these perl and awk solutions do qualify as "Unix one line" solution, but they are too long to be typed by hand, and cut-and-pasting is asking for errors. I personally would go a short script (in user's language of choice), probably reading the FASTA file directly.
10-20-2009, 11:12 AM   #13
sklages
Senior Member

Location: Berlin, DE

Join Date: May 2008
Posts: 628

Quote:
 Originally Posted by maubp I personally would go a short script (in user's language of choice), probably reading the FASTA file directly.
I do totally agree, that's probably the way we all do it :-)

cheers,
Sven

 10-20-2009, 11:13 AM #14 nhansen Junior Member   Location: Maryland, USA Join Date: Sep 2009 Posts: 6 I actually prefer a script too, both for readability and accuracy, but sometimes it's fun to exercise your shell skills. :-)
10-21-2009, 10:27 AM   #15
seqseq
Junior Member

Location: Spain

Join Date: Oct 2009
Posts: 7

Quote:
 Originally Posted by nhansen sort -rn contig_lengths.txt | awk 'BEGIN {sum=0} {sum += \$1; print \$1, sum}' | tac - | awk 'NR==1 {halftot=\$2/2} lastsize>halftot && \$2
Great! The first part could be shortened into: awk '{sum += \$0; print \$0, sum}'

So in total:
sort -rn contig_lengths.txt | awk '{sum += \$0; print \$0, sum}' | tac | awk 'NR==1 {halftot=\$2/2} lastsize>halftot && \$2<halftot {print} {lastsize=\$2}'

 10-21-2009, 10:47 AM #16 sklages Senior Member   Location: Berlin, DE Join Date: May 2008 Posts: 628 The perl stuff is faster, at least if you have a few 100,000 of contigs Sven
 10-23-2009, 09:26 AM #17 seb567 Senior Member   Location: Québec, Canada Join Date: Jul 2008 Posts: 260 There is a program called getN50 in the amos package. http://amos.sf.net/ http://downloads.sourceforge.net/pro...se_mirror=iweb
 10-23-2009, 12:05 PM #18 flxlex Moderator   Location: Oslo, Norway Join Date: Nov 2008 Posts: 415 A single awk, no pipes (except between sort and awk), and thereby somewhat shorter. Note sort -n, not sort -rn sort -n contig_lengths.txt | awk '{len[i++]=\$1;sum+=\$1} END {for (j=0;jsum/2) {print len[j];break}}}'
 11-03-2009, 09:42 AM #19 eslondon Member   Location: London, UK Join Date: Jul 2009 Posts: 21 Perl clearly 10 times faster... macbook:~\$ time perl -ne 'chomp(); push(@contigs,\$_);\$total+=\$_;END{foreach(sort{\$b<=>\$a}@contigs){\$sum+=\$_;\$L=\$_;if(\$sum>=\$total*0.5){print "TOTAL: \$total\nN50 : \$L\n";exit;} ;}}' contigs_100.lines TOTAL: 59789620 N50 : 212 real 0m0.444s user 0m0.348s sys 0m0.038s macbook:~ \$ time sort -n contigs_100.lines | awk '{len[i++]=\$1;sum+=\$1} END {for (j=0;jsum/2) {print len[j];break}}}' 212 real 0m5.502s user 0m4.055s sys 0m0.560s
 11-04-2009, 01:25 AM #20 flxlex Moderator   Location: Oslo, Norway Join Date: Nov 2008 Posts: 415 Cool. Well, consider it an exercise in awk rather than an attempt to beat perl...