SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Geneid can predict differnet genes when the length of input sequence are different? bio_tt Bioinformatics 0 01-06-2012 05:47 PM
Illumina adaptor sequence filter slny Bioinformatics 1 04-21-2011 01:25 PM
Roche gsMapper output exon contigs rather than full-length sequence? sulicon Bioinformatics 0 02-28-2011 05:51 PM
samtools view error: CIGAR and sequence length are inconsistent (tophat/bowtie) glacierbird Bioinformatics 2 06-29-2010 02:58 AM
Help:primer and low complexity sequence filter alvin1982 Illumina/Solexa 0 04-21-2010 08:05 PM

Reply
 
Thread Tools
Old 09-09-2011, 01:13 AM   #1
NicoBxl
not just another member
 
Location: Belgium

Join Date: Aug 2010
Posts: 264
Default filter sequence by length

Hi,

To filter sequencesby length from a fasta file, I'm using now a bioperl script. But is there an another method (faster) ?

here's my little script


Code:
#!/usr/bin/perl

use warnings;
use strict;
use Bio::SeqIO;


my $file = $ARGV[0]; 
my $min = $ARGV[1];
my $max = $ARGV[2];
my $out = $ARGV[3];

open (FILE, ">>$out") or die ("Error : Cannot open file $out for writing..!\n");

my $seq_in  = Bio::SeqIO->new( -format => 'fasta',-file => $file);

while( my $seq1 = $seq_in->next_seq() ) {	
	
	my $id  = $seq1->primary_id;
	chomp $id;
	my $seq = $seq1->seq;
	chomp $seq;
	my $lseq = length($seq);
	if($lseq>=$min && $lseq <=$max){
		print FILE ">",$id,"\n",$seq,"\n";	
	}
}
NicoBxl is offline   Reply With Quote
Old 09-09-2011, 01:22 AM   #2
Giorgio C
Member
 
Location: ITALY

Join Date: Oct 2010
Posts: 89
Default

You could try Galaxy:

http://main.g2.bx.psu.edu/root?tool_...lter_by_length

It's very fast and you can do different type of analysis.
Giorgio C is offline   Reply With Quote
Old 09-09-2011, 01:23 AM   #3
NicoBxl
not just another member
 
Location: Belgium

Join Date: Aug 2010
Posts: 264
Default

Thanks,

is it possible to have this tool in command line ?
NicoBxl is offline   Reply With Quote
Old 09-09-2011, 01:57 AM   #4
Giorgio C
Member
 
Location: ITALY

Join Date: Oct 2010
Posts: 89
Default

From Galaxy i don't think is possible. But probably if you search you may find a similar script.
Giorgio C is offline   Reply With Quote
Old 09-09-2011, 02:27 AM   #5
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

With Biopieces (www.biopieces.org) you can do:

Code:
read_fasta -i test.fna | grab -e 'SEQ_LEN > 10' | grab -e 'SEQ_LEN <= 100' | write_fasta -x
Cheers,


Martin
maasha is offline   Reply With Quote
Old 09-09-2011, 10:08 AM   #6
tnabtaf
Member
 
Location: Oregon

Join Date: Jan 2011
Posts: 53
Default

Quote:
Originally Posted by Giorgio C View Post
From Galaxy i don't think is possible. But probably if you search you may find a similar script.
This is a command line tool that is included in the Galaxy distribution. I've included the code below.

To get any tool that is included with Galaxy, follow the instructions at http://getgalaxy.org/. Tools are defined in the tools directory. Many will have dependencies (although the one below does not).

Code:
#!/usr/bin/env python
"""
Input: fasta, minimal length, maximal length
Output: fasta
Return sequences whose lengths are within the range.
"""

import sys, os

assert sys.version_info[:2] >= ( 2, 4 )

def stop_err( msg ):
    sys.stderr.write( msg )
    sys.exit()

def __main__():
    input_filename = sys.argv[1]
    try:
        min_length = int( sys.argv[2] )
    except:
        stop_err( "Minimal length of the return sequence requires a numerical value." )
    try:
        max_length = int( sys.argv[3] )
    except:
        stop_err( "Maximum length of the return sequence requires a numerical value." )
    output_filename = sys.argv[4]
    output_handle = open( output_filename, 'w' )
    tmp_size = 0 #-1
    tmp_buf = ''
    at_least_one = 0
    for line in file(input_filename):
        if not line or line.startswith('#'):
            continue
        if line[0] == '>':
            if min_length <= tmp_size <= max_length or (min_length <= tmp_size and max_length == 0):
                output_handle.write(tmp_buf)
                at_least_one = 1
            tmp_buf = line
            tmp_size = 0                                                       
        else:
            if max_length == 0 or tmp_size < max_length:
                tmp_size += len(line.rstrip('\r\n'))
                tmp_buf += line
    # final flush of buffer
    if min_length <= tmp_size <= max_length or (min_length <= tmp_size and max_length == 0):
        output_handle.write(tmp_buf.rstrip('\r\n'))
        at_least_one = 1
    output_handle.close()
    if at_least_one == 0:
        print "There is no sequence that falls within your range."

if __name__ == "__main__" : __main__()
tnabtaf is offline   Reply With Quote
Old 09-09-2011, 12:00 PM   #7
Giorgio C
Member
 
Location: ITALY

Join Date: Oct 2010
Posts: 89
Default

Very useful. Thanks
Giorgio C is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:55 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO