Seqanswers Leaderboard Ad

**GenoMax** · 05-30-2014, 08:35 AM

You could do a reciprocal blat and select those that do not show a hit against the other set over the full length, if "unique" is exactly what you want.

If parts of the proteins are going to be common (domains) then you may have to do some additional parsing/work.

CD-HIT may also work: http://weizhong-lab.ucsd.edu/cd-hit/

**gringer** · 05-31-2014, 04:47 AM

Code:

cat file1 file2 | sort | uniq -u

will give you exactly unique lines from a file, and it's about as fast as you can get. But you probably don't want to do that because a naive sort won't preserve other information like sequence names, etc..

**JamieHeather** · 05-31-2014, 05:58 AM

Woa, do you want proteins that only appear once in a given file, or proteins that only appear in one file or the other?

Both are easily doable in bash. If it's the first one, then gringer's approach with uniq -u will work.

If you want the latter, you'll need to do a bit more work. First, you need to sort both files and (after ensuring that they consist only of one entry for a given protein, which can also be done with uniq) then use comm with the -3 flag: this will output two columns, with lines unique to the first and second file (with shared lines suppressed by -3).

**woa** · 06-01-2014, 08:14 PM

Thanks for your answers.
I've two sequences sets derived from two databases with different fasta headers, and I wish to find unique sequences belonging to each of these databases.I wish to retain the Fasta header information for the unique sequences as well.
I can write a perl hash based script for such a comparison but that'll be quite slow I beleive.
Any other options?

**JamieHeather** · 06-02-2014, 03:19 AM

Do you expect the actual fasta sequences to be exact matches, but with different headers, across the two databases?

For instance if ProteinA was in database 1, and there was a fasta entry of a section of the same protein in database 2, would you count that as being in both databases or is that two unique entries?

Either way you'll probably have to write your own script, but if you are indeed looking for unique proteins you're probably in for a little bit more work.

**woa** · 06-03-2014, 08:00 AM

Originally posted by JamieHeather View Post

Do you expect the actual fasta sequences to be exact matches, but with different headers, across the two databases?

For instance if ProteinA was in database 1, and there was a fasta entry of a section of the same protein in database 2, would you count that as being in both databases or is that two unique entries?

Either way you'll probably have to write your own script, but if you are indeed looking for unique proteins you're probably in for a little bit more work.

Thanks for your reply .

I'll consider only exact matches and hence I'll consider ProteinA and ProteinB unique to their corresponding databases.

and plan to write something like this in perl:

use List::Util qw(first);

if( !defined ( first {$_ eq $seqA} @sorted_seqB ) ){.....}

#$seqA is a sequence in DatabaseA and $_ is one of the sequences of DatbaseB

To speed up I might try the MCE module

use MCE::Grep;

if(! mce_grep {$_ eq $seqA} @sorted_seqB){....}

**JamieHeather** · 06-03-2014, 01:36 PM

I'm not very fluent in Perl, but I think that might do the trick.

For posterity, my approach in python would just be to make a defaultdict for each database, with the sequence strings as keys (and probably the SeqIO fasta or fasta ID instance for the value), then take do db1.keys().difference(db2.keys()) to find those sequences unique to each database.

**gringer** · 06-03-2014, 04:05 PM

I just cooked up some rough code. It needs to store sequence and ID in memory for all sequences in the first file, but should be otherwise fairly space/time efficient. Here's an overview of how it does it:

Read all lines from file1, store sequence, id as hash (hash to sequence)
Read all lines from file2, print out non-matching sequence/id and delete any matching sequence/id
Print remaining non-matching sequence/id from file1

Code:

#!/usr/bin/perl
use warnings;
use strict;

open(my $f1, "< file1.fasta") or die("Cannot open file1");
open(my $f2, "< file2.fasta") or die("Cannot open file2");

my %seenSequences = ();
my $sequence = "";
my $seqID = "";

while(<$f1>){
  chomp;
  if(/^>(.*)$/){
    $seenSequences{$sequence} = $seqID if $seqID ne "";
    $sequence = "";
    $seqID = $1;
  } else {
    $sequence .= $_;
  }
}
$seenSequences{$sequence} = $seqID if $seqID ne "";
close($f1);

$sequence = "";
$seqID = "";
while(<$f2>){
  chomp;
  if(/^>(.*)$/){
    if(($seqID ne "") && !exists($seenSequences{$sequence})){
      printf(">%s [2]\n%s\n", $seqID, $sequence);
    } else {
      delete($seenSequences{$sequence});
    }
    $seqID = $1;
    $sequence = "";
  } else {
    $sequence .= $_;
  }
}
if(($seqID ne "") && !exists($seenSequences{$sequence})){
  printf(">%s [2]\n%s\n", $seqID, $sequence);
} else {
  delete($seenSequences{$sequence});
}
close($f1);

while(my ($seq, $id) = each(%seenSequences)){
  printf(">%s [1]\n%s\n", $id, $seq);
}

Here's an example run:

Code:

 ./141964.pl > out.fasta && head *.fasta
==> file1.fasta <==                                                                                                                                                                                                               
>1                                                                                                                                                                                                                                
PRTEINEIN                                                                                                                                                                                                                         
>3
PRTEINTHREE

==> file2.fasta <==
>1
PRTEINEIN
>2
PRTEINNI

==> out.fasta <==
>2 [2]
PRTEINNI
>3 [1]
PRTEINTHREE

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Tool for finding unique sets of sequences

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News