SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
N50 and N90 contig size refer to? edge Bioinformatics 27 07-13-2015 03:13 PM
Want to use extract_genomic_dna in command line louis7781x Bioinformatics 2 12-04-2011 05:51 AM
SAMtools command line ??? Pawan Noel Bioinformatics 6 11-16-2010 10:42 AM
Tophat command line options ice RNA Sequencing 6 09-02-2010 03:25 PM
SIFT on the command line lamasmi Bioinformatics 2 08-17-2010 09:32 AM

Reply
 
Thread Tools
Old 10-16-2009, 03:40 AM   #1
seqseq
Junior Member
 
Location: Spain

Join Date: Oct 2009
Posts: 7
Default Unix one-line command for N50 size

Is there a good one-liner for calculating the N50 size of a list of contig sizes? Average size is easy with awk but I didn't manage to find an easy way for the N50 size so far. Thanks!
seqseq is offline   Reply With Quote
Old 10-16-2009, 04:20 AM   #2
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

If you have the mean, you can clearly grep/awk your relevant file for the correct data to compute N50. Why not just write a little generic statistics program that computes descriptive stats for a file or for data passed via stdin? Here's a Python function for median (or you could use numpy) that you could dump into a little script. This would compute the N50...

PHP Code:
def median (numlist):
    
"""
    Abstract: Returns the median value of the passed list of numbers. 
    Usage:    median(numlist)
    """

    
numlist.sort()
    
# take the mean of the two middle elements if there are an even number
    # of elements.  otherwise, take the middle element
    
if len(numlist) % == 0:
        
medianpos len(numlist)/2  
        median 
float(numlist[medianpos] + numlist[medianpos-1]) /2
    
else:
        
medianpos len(numlist)/2
        median 
numlist[medianpos]
    return 
median 
The upside of writing such a program is that, if written correctly, it can be re-used for asking basic questions of your data over and over again.

Good luck,
Aaron
quinlana is offline   Reply With Quote
Old 10-16-2009, 05:02 AM   #3
seqseq
Junior Member
 
Location: Spain

Join Date: Oct 2009
Posts: 7
Default

Well, I am looking for a Unix command-line solution.. (btw if I understand your code correctly it is calculating the median, but median is not the same as N50 size)
seqseq is offline   Reply With Quote
Old 10-16-2009, 05:07 AM   #4
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

My apologies, my recollection was that N50 was merely the median size. By actually looking up the definition, I see it's not.
quinlana is offline   Reply With Quote
Old 10-16-2009, 05:09 AM   #5
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

You can call Python scripts from the command line, and pipe stdin and stout too - great for use at the Unix command line.

Are you planning to supply the list of lengths as space separated integers? e.g. "120 2500 745 63 ..."? That would be easy to parse in Python from a string into a list of integers.
maubp is offline   Reply With Quote
Old 10-16-2009, 05:18 AM   #6
seqseq
Junior Member
 
Location: Spain

Join Date: Oct 2009
Posts: 7
Default

It comes up in several occasions, e.g. a file containing

contig1 length=1000 reads=582
contig2 length=1500 reads=900
contig3 length=1400 reads=766
...

I want to do something like:
$ cut -f2 file | cut -d"=" -f2 | sort -nr | awk-or-whatever-for-the-N50-size

The files can have different formats, I just have a sorted list of contig sizes at some point and it would be useful to have a command to get the N50 size.
seqseq is offline   Reply With Quote
Old 10-16-2009, 05:29 AM   #7
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

OK - so the stdin is one integer per line. How about a python script like this,
see also http://seqanswers.com/forums/showthread.php?t=2332

Code:
#!/usr/bin/python
import sys

def N50(numlist):
    """
    Abstract: Returns the N50 value of the passed list of numbers. 
    Usage:    N50(numlist)

    Based on the Broad Institute definition:
    https://www.broad.harvard.edu/crd/wiki/index.php/N50
    """
    numlist.sort()
    newlist = []
    for x in numlist :
        newlist += [x]*x
    # take the mean of the two middle elements if there are an even number
    # of elements.  otherwise, take the middle element
    if len(newlist) % 2 == 0:
        medianpos = len(newlist)/2  
        return float(newlist[medianpos] + newlist[medianpos-1]) /2
    else:
        medianpos = len(newlist)/2
        return newlist[medianpos]

assert N50([2, 2, 2, 3, 3, 4, 8, 8]) == 6

lengths = []
for line in sys.stdin :
    lengths.append(int(line))
print N50(lengths)
Then at the Unix command line, you could use it like this:
Code:
$ grep "^>" 454AllContigs.fna | cut -d"=" -f2 | cut -d" " -f1 | ./stdin_N50.py 
386

Last edited by maubp; 10-16-2009 at 06:23 AM. Reason: Switched from median to N50
maubp is offline   Reply With Quote
Old 10-16-2009, 06:06 AM   #8
seqseq
Junior Member
 
Location: Spain

Join Date: Oct 2009
Posts: 7
Default

What about the awk experts out there? Any idea for the direct command-line version without script? (you have to store the script somewhere, and if you are using another computer you don't have it at hand... I actually really prefer code in the command line instead of applying a script - if possible..)
seqseq is offline   Reply With Quote
Old 10-16-2009, 06:24 AM   #9
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

I rather suspect that an awk solution would be so long that it too would need to be stored in a script. A one liner would be impressive.
maubp is offline   Reply With Quote
Old 10-18-2009, 11:20 PM   #10
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Quote:
Originally Posted by seqseq View Post
It comes up in several occasions, e.g. a file containing

contig1 length=1000 reads=582
contig2 length=1500 reads=900
contig3 length=1400 reads=766
...

I want to do something like:
$ cut -f2 file | cut -d"=" -f2 | sort -nr | awk-or-whatever-for-the-N50-size

The files can have different formats, I just have a sorted list of contig sizes at some point and it would be useful to have a command to get the N50 size.

Code:
perl -lanF'[\s=]' -e 'push(@contigs,$F[2]);$total+=$F[2];END{foreach(sort{$b<=>$a}@contigs){$sum+=$_;$L=$_;if($sum>=$total*0.5){print "TOTAL: $total\nN50  : $L";exit;}  ;}}' file
This is a quick shot. There might be shorter solution. Test it to see wether I calculate correctly ..

cheers,
Sven
sklages is offline   Reply With Quote
Old 10-20-2009, 07:33 AM   #11
nhansen
Junior Member
 
Location: Maryland, USA

Join Date: Sep 2009
Posts: 6
Default

Here's a stab at a one-liner :

sort -rn contig_lengths.txt | awk 'BEGIN {sum=0} {sum += $1; print $1, sum}' | tac - | awk 'NR==1 {halftot=$2/2} lastsize>halftot && $2<halftot {print} {lastsize=$2}'

For a file "contig_lengths.txt" with one contig size per line, this prints out the length of the N50 contig, along with the number of bases in contigs this size or greater.
nhansen is offline   Reply With Quote
Old 10-20-2009, 07:44 AM   #12
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Impressive

Assuming no errors, these perl and awk solutions do qualify as "Unix one line" solution, but they are too long to be typed by hand, and cut-and-pasting is asking for errors.

I personally would go a short script (in user's language of choice), probably reading the FASTA file directly.
maubp is offline   Reply With Quote
Old 10-20-2009, 11:12 AM   #13
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Quote:
Originally Posted by maubp View Post
I personally would go a short script (in user's language of choice), probably reading the FASTA file directly.
I do totally agree, that's probably the way we all do it :-)

cheers,
Sven
sklages is offline   Reply With Quote
Old 10-20-2009, 11:13 AM   #14
nhansen
Junior Member
 
Location: Maryland, USA

Join Date: Sep 2009
Posts: 6
Default

I actually prefer a script too, both for readability and accuracy, but sometimes it's fun to exercise your shell skills. :-)
nhansen is offline   Reply With Quote
Old 10-21-2009, 10:27 AM   #15
seqseq
Junior Member
 
Location: Spain

Join Date: Oct 2009
Posts: 7
Default

Quote:
Originally Posted by nhansen View Post
sort -rn contig_lengths.txt | awk 'BEGIN {sum=0} {sum += $1; print $1, sum}' | tac - | awk 'NR==1 {halftot=$2/2} lastsize>halftot && $2<halftot {print} {lastsize=$2}'
Great! The first part could be shortened into: awk '{sum += $0; print $0, sum}'

So in total:
sort -rn contig_lengths.txt | awk '{sum += $0; print $0, sum}' | tac | awk 'NR==1 {halftot=$2/2} lastsize>halftot && $2<halftot {print} {lastsize=$2}'
seqseq is offline   Reply With Quote
Old 10-21-2009, 10:47 AM   #16
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

The perl stuff is faster, at least if you have a few 100,000 of contigs

Sven
sklages is offline   Reply With Quote
Old 10-23-2009, 09:26 AM   #17
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

There is a program called getN50 in the amos package.

http://amos.sf.net/

http://downloads.sourceforge.net/pro...se_mirror=iweb
seb567 is offline   Reply With Quote
Old 10-23-2009, 12:05 PM   #18
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

A single awk, no pipes (except between sort and awk), and thereby somewhat shorter. Note sort -n, not sort -rn

sort -n contig_lengths.txt | awk '{len[i++]=$1;sum+=$1} END {for (j=0;j<i+1;j++) {csum+=len[j]; if (csum>sum/2) {print len[j];break}}}'
flxlex is offline   Reply With Quote
Old 11-03-2009, 09:42 AM   #19
eslondon
Member
 
Location: London, UK

Join Date: Jul 2009
Posts: 21
Default

Perl clearly 10 times faster...

macbook:~$ time perl -ne 'chomp(); push(@contigs,$_);$total+=$_;END{foreach(sort{$b<=>$a}@contigs){$sum+=$_;$L=$_;if($sum>=$total*0.5){print "TOTAL: $total\nN50 : $L\n";exit;} ;}}' contigs_100.lines
TOTAL: 59789620
N50 : 212

real 0m0.444s
user 0m0.348s
sys 0m0.038s
macbook:~ $ time sort -n contigs_100.lines | awk '{len[i++]=$1;sum+=$1} END {for (j=0;j<i+1;j++) {csum+=len[j]; if (csum>sum/2) {print len[j];break}}}'
212

real 0m5.502s
user 0m4.055s
sys 0m0.560s
eslondon is offline   Reply With Quote
Old 11-04-2009, 01:25 AM   #20
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

Cool. Well, consider it an exercise in awk rather than an attempt to beat perl...
flxlex is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:29 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO