![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Extract sequence from multi fasta file with PERL | andreitudor | Bioinformatics | 27 | 07-07-2019 08:45 AM |
extract data from fasta-files with perl?? | anna_ | Bioinformatics | 20 | 02-17-2016 08:29 AM |
Perl: get specific base from FASTA file. | njh_TO | Bioinformatics | 6 | 02-02-2012 06:34 AM |
Redundant(?) report problem in tophat .sam file? | Gangcai | Bioinformatics | 2 | 03-16-2010 01:05 AM |
Script to remove gap-only sites from fasta alignment? | kmkocot | Bioinformatics | 4 | 02-23-2010 10:50 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: France Join Date: Sep 2011
Posts: 52
|
![]()
Hi,
I have some trouble with removing redundant feature in a fasta file. I want to create an indexe for Bowtie. I have this file : Code:
>pi1 ATGCGTGAAATGCAT >pi2 TGCCCTGATAGGGACCAGTAGAC >pi3 ATGCGTGAAATGCATA >pi4 TGCATGACTA >pi5 ATGCGTGAAATGCATAT Code:
>pi2 TGCCCTGATAGGGACCAGTAGAC >pi4 TGCATGACTA >pi5 ATGCGTGAAATGCATAT Thanks in advance for your reply ! |
![]() |
![]() |
![]() |
#3 |
Member
Location: FL Join Date: Dec 2008
Posts: 26
|
![]()
here you go:
http://www.oceanridgebio.com/ Code:
#!/usr/bin/perl # by Yonggan Wu # yongganw@oceanridgebio.com # Ocean Ridge Bioscience LLC # Version 01 Date: 2012-02-16 15:01:08 # Version 01 Updates: # Input file: a fasta file # Output file: a unique fasta file # System Requirements: linux, perl # Usage: perl test.pl infile.fasta ################################################################################ use strict; use warnings; #read the file into a hash my %seq; my $title; my $infile=shift or die "give me a infile\n" open (IN,"$infile"); while (<IN>){ $_=~s/\n//; $_=~s/\r//; if ($_=~/>/){ $title=$_; $title=~s/>//; } else{ $seq{$_}=$title; } } close IN; #remove the abundant sequences my @seq=keys (%seq); my @uniqueseq; my $find=0; foreach (@seq){ $find=0; my $seq=uc($_); foreach (@uniqueseq){ if ($seq=~/$_/){ $_=$seq;#replace with longer seq $find=1; } if ($_=~/$seq/){ $find=1; } } if ($find==0){ push @uniqueseq,$seq; } } #outout the final result open (OUT,">output.fasta"); foreach (@uniqueseq){ print OUT ">$seq{$_}\n$_\n"; } close OUT; |
![]() |
![]() |
![]() |
#4 |
Member
Location: France Join Date: Sep 2011
Posts: 52
|
![]()
thanks a lot, it's exactly what i need !
|
![]() |
![]() |
![]() |
#5 | |||
Member
Location: France Join Date: Sep 2011
Posts: 52
|
![]()
I re open this thread because i solved the problem with the script bellow, but now, i have bigger file and it's time consuming.
I try to solve my problem with PRINSEQ, with the following comand line, but it did'nt work, it only remove reads that have the exact same sequence Quote:
Here is an example of what i would like to do : INPUT : Quote:
Quote:
|
|||
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Vancouver, BC Join Date: Mar 2010
Posts: 275
|
![]()
The "1" in the argument
Code:
-derep 123 |
![]() |
![]() |
![]() |
#7 |
Member
Location: France Join Date: Sep 2011
Posts: 52
|
![]()
Yes i tested the script but it's really time consuming for my dataset; it is running since more than 24 hours ...
|
![]() |
![]() |
![]() |
#8 | |
Member
Location: France Join Date: Sep 2011
Posts: 52
|
![]()
I re opened this thread because i have a similar problem.
I have family of sequence sharing the same "motif" as in my example below : Quote:
>h TCCACAACGATGGAAGATGATGA How can i do this ? is there any tools able to do this ? |
|
![]() |
![]() |
![]() |
#9 |
Member
Location: Oregon Join Date: Feb 2011
Posts: 29
|
![]()
grep your sequence & sort results by length is a start
|
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: germany Join Date: Oct 2009
Posts: 140
|
![]()
Given a file with unaligned sequences s1...sm of different lengths.
You want to quickly identify sequences that are substrings of another one. ---------------------------------------------------------- Compute and store sequence lengths. Pick some random substrings t1,...,tk of suitable length l , k and l to be calculated from m and the lengths for optimal speed and memory usage. For each (i,j) compute and store whether sequence si contains substring tj. This can be calculated in one step, by going through all the sequences and letters. Walk through the sequence pairs (i,j) and consider pairs only where each substring covered by i is also covered by j and length(j)>=length(i) These have then to be checked in detail then whether i is a substring of j, but most pairs are quickly discarded and needn't be considered, so I think this would be pretty fast. {for the definitions of subsequence and substring see wikipedia} |
![]() |
![]() |
![]() |
Thread Tools | |
|
|