SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract sequence from multi fasta file with PERL andreitudor Bioinformatics 27 07-07-2019 08:45 AM
extract data from fasta-files with perl?? anna_ Bioinformatics 20 02-17-2016 08:29 AM
Perl: get specific base from FASTA file. njh_TO Bioinformatics 6 02-02-2012 06:34 AM
Redundant(?) report problem in tophat .sam file? Gangcai Bioinformatics 2 03-16-2010 01:05 AM
Script to remove gap-only sites from fasta alignment? kmkocot Bioinformatics 4 02-23-2010 10:50 AM

Reply
 
Thread Tools
Old 02-14-2012, 08:48 AM   #1
StephaniePi83
Member
 
Location: France

Join Date: Sep 2011
Posts: 52
Default perl : Remove redundant feature in fasta file

Hi,

I have some trouble with removing redundant feature in a fasta file. I want to create an indexe for Bowtie.
I have this file :
Code:
>pi1
ATGCGTGAAATGCAT
>pi2
TGCCCTGATAGGGACCAGTAGAC
>pi3
ATGCGTGAAATGCATA
>pi4
TGCATGACTA
>pi5
ATGCGTGAAATGCATAT
But some feature have the same sequence but differ in length. I want to keep only the longer sequence. Finally, i would like to have a file like :
Code:
>pi2
TGCCCTGATAGGGACCAGTAGAC
>pi4
TGCATGACTA
>pi5
ATGCGTGAAATGCATAT
I don't know how i can do this ... maybe someone can help me ?

Thanks in advance for your reply !
StephaniePi83 is offline   Reply With Quote
Old 02-15-2012, 08:24 AM   #2
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

I think PRINSEQ will do exactly what you are wanting. Take a look at the manual under the section about sequence duplications.
SES is offline   Reply With Quote
Old 02-16-2012, 12:13 PM   #3
weasteam
Member
 
Location: FL

Join Date: Dec 2008
Posts: 26
Default

here you go:
http://www.oceanridgebio.com/

Code:
#!/usr/bin/perl
#	by Yonggan Wu
#	yongganw@oceanridgebio.com
#	Ocean Ridge Bioscience LLC
#	Version 01 Date: 2012-02-16 15:01:08 
#	Version 01 Updates: 
#	Input file: a fasta file
#	Output file: a unique fasta file
#	System Requirements: linux, perl
#	Usage: perl test.pl infile.fasta
################################################################################
use strict;
use warnings;
#read the file into a hash
my %seq;
my $title;
my $infile=shift or die "give me a infile\n"
open (IN,"$infile");
while (<IN>){
	$_=~s/\n//;
	$_=~s/\r//;
	if ($_=~/>/){
		$title=$_;
		$title=~s/>//;
	}
	else{
		$seq{$_}=$title;
	}
}
close IN;
#remove the abundant sequences
my @seq=keys (%seq);
my @uniqueseq;
my $find=0;
foreach (@seq){
	$find=0;
	my $seq=uc($_);
	foreach (@uniqueseq){
		if ($seq=~/$_/){
			$_=$seq;#replace with longer seq
			$find=1;
		}
		if ($_=~/$seq/){
			$find=1;
		}
	}
	if ($find==0){
		push @uniqueseq,$seq;
	}
}
#outout the final result
open (OUT,">output.fasta");
foreach (@uniqueseq){
	print OUT ">$seq{$_}\n$_\n";
}
close OUT;
weasteam is offline   Reply With Quote
Old 02-23-2012, 12:31 AM   #4
StephaniePi83
Member
 
Location: France

Join Date: Sep 2011
Posts: 52
Default

thanks a lot, it's exactly what i need !
StephaniePi83 is offline   Reply With Quote
Old 09-04-2012, 04:27 AM   #5
StephaniePi83
Member
 
Location: France

Join Date: Sep 2011
Posts: 52
Default

I re open this thread because i solved the problem with the script bellow, but now, i have bigger file and it's time consuming.
I try to solve my problem with PRINSEQ, with the following comand line, but it did'nt work, it only remove reads that have the exact same sequence
Quote:
perl prinseq-lite.pl -verbose -fasta tmp1.fa -derep 123 -out_format 1
Someone is familiar with this tool ?
Here is an example of what i would like to do :
INPUT :
Quote:
>pi1
AAAAAAAAAATTAAGGGCCAGCTGA
>pi12
AAAAAAAAAATTAAGGGCCAGCTGAA
>pi13
AAAAAAAAACTTGAACTCTACTGC
>pi14
AAAAAAAAATTAAGGGCCAGCTGAA
>pi15
AAAAAAAAATTTTGGATGATCTTAAT
>pi16
AAAAAAAAATTTTGGATGATCTTAATT
>pi17
AAAAAAAACAAGGTCGGCATAAAG
>pi18
AAAAAAAACGAACATGAGAGGATGGA
OUTPUT :
Quote:
>pi12
AAAAAAAAAATTAAGGGCCAGCTGAA
>pi13
AAAAAAAAACTTGAACTCTACTGC
>pi16
AAAAAAAAATTTTGGATGATCTTAATT
>pi17
AAAAAAAACAAGGTCGGCATAAAG
>pi18
AAAAAAAACGAACATGAGAGGATGGA
Thanks in advance for your help !
StephaniePi83 is offline   Reply With Quote
Old 09-04-2012, 07:12 AM   #6
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

The "1" in the argument
Code:
-derep 123
to prinseq-lite.pl says to remove exact duplicates only. As I understand your question now, I don't think prinseq will actually do what you want. Have you tried the perl script posted by weastem above? I have not tested it but that may be a solution for you.
SES is offline   Reply With Quote
Old 09-04-2012, 07:19 AM   #7
StephaniePi83
Member
 
Location: France

Join Date: Sep 2011
Posts: 52
Default

Yes i tested the script but it's really time consuming for my dataset; it is running since more than 24 hours ...
StephaniePi83 is offline   Reply With Quote
Old 12-14-2012, 07:47 AM   #8
StephaniePi83
Member
 
Location: France

Join Date: Sep 2011
Posts: 52
Default

I re opened this thread because i have a similar problem.
I have family of sequence sharing the same "motif" as in my example below :
Quote:
>a
ACTTTCCACAACGATGGAAGATGATGA
>b
ACTTTCCACAACGATGGAAGATGATGAA
>c
ATTCCACAACGATGGAAGATGATGAA
>d
CTTTCCACAACGATGGAAGATGATGAA
>e
NTCCACAACGATGGAAGATGATGAAGA
>f
TACTTTCCACAACGATGGAAGATGATGA
>g
TACTTTCCACAACGATGGAAGATGATGAA
>h
TCCACAACGATGGAAGATGATGA
I would like to get the smallest sequence sharing the motif, in my example :
>h
TCCACAACGATGGAAGATGATGA
How can i do this ? is there any tools able to do this ?
StephaniePi83 is offline   Reply With Quote
Old 12-15-2012, 01:29 AM   #9
tomc
Member
 
Location: Oregon

Join Date: Feb 2011
Posts: 29
Default

grep your sequence & sort results by length is a start
tomc is offline   Reply With Quote
Old 12-15-2012, 07:01 PM   #10
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

Given a file with unaligned sequences s1...sm of different lengths.
You want to quickly identify sequences that are substrings of another one.
----------------------------------------------------------
Compute and store sequence lengths.

Pick some random substrings t1,...,tk of suitable length l ,
k and l to be calculated from m and the lengths for optimal speed and memory usage.

For each (i,j) compute and store whether sequence si contains substring tj.
This can be calculated in one step, by going through all the sequences and letters.

Walk through the sequence pairs (i,j) and consider pairs only where each
substring covered by i is also covered by j and length(j)>=length(i)

These have then to be checked in detail then whether i is a substring of j,
but most pairs are quickly discarded and needn't be considered,
so I think this would be pretty fast.


{for the definitions of subsequence and substring see wikipedia}
gsgs is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:32 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO