Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tool for finding unique sets of sequences

    I've two sets of large number of proteins( in the order 100K) , and wish to find out unique proteins belonging to each set.

    Is there any tool for doing it fast?

    Thanks

  • #2
    You could do a reciprocal blat and select those that do not show a hit against the other set over the full length, if "unique" is exactly what you want.

    If parts of the proteins are going to be common (domains) then you may have to do some additional parsing/work.

    CD-HIT may also work: http://weizhong-lab.ucsd.edu/cd-hit/
    Last edited by GenoMax; 05-30-2014, 08:38 AM.

    Comment


    • #3
      Code:
      cat file1 file2 | sort | uniq -u
      will give you exactly unique lines from a file, and it's about as fast as you can get. But you probably don't want to do that because a naive sort won't preserve other information like sequence names, etc..

      Comment


      • #4
        Woa, do you want proteins that only appear once in a given file, or proteins that only appear in one file or the other?

        Both are easily doable in bash. If it's the first one, then gringer's approach with uniq -u will work.

        If you want the latter, you'll need to do a bit more work. First, you need to sort both files and (after ensuring that they consist only of one entry for a given protein, which can also be done with uniq) then use comm with the -3 flag: this will output two columns, with lines unique to the first and second file (with shared lines suppressed by -3).

        Comment


        • #5
          Thanks for your answers.
          I've two sequences sets derived from two databases with different fasta headers, and I wish to find unique sequences belonging to each of these databases.I wish to retain the Fasta header information for the unique sequences as well.
          I can write a perl hash based script for such a comparison but that'll be quite slow I beleive.
          Any other options?

          Comment


          • #6
            Do you expect the actual fasta sequences to be exact matches, but with different headers, across the two databases?

            For instance if ProteinA was in database 1, and there was a fasta entry of a section of the same protein in database 2, would you count that as being in both databases or is that two unique entries?

            Either way you'll probably have to write your own script, but if you are indeed looking for unique proteins you're probably in for a little bit more work.

            Comment


            • #7
              Originally posted by JamieHeather View Post
              Do you expect the actual fasta sequences to be exact matches, but with different headers, across the two databases?

              For instance if ProteinA was in database 1, and there was a fasta entry of a section of the same protein in database 2, would you count that as being in both databases or is that two unique entries?

              Either way you'll probably have to write your own script, but if you are indeed looking for unique proteins you're probably in for a little bit more work.
              Thanks for your reply .

              I'll consider only exact matches and hence I'll consider ProteinA and ProteinB unique to their corresponding databases.

              and plan to write something like this in perl:

              use List::Util qw(first);

              if( !defined ( first {$_ eq $seqA} @sorted_seqB ) ){.....}

              #$seqA is a sequence in DatabaseA and $_ is one of the sequences of DatbaseB

              To speed up I might try the MCE module

              use MCE::Grep;

              if(! mce_grep {$_ eq $seqA} @sorted_seqB){....}
              Last edited by woa; 06-03-2014, 08:13 AM.

              Comment


              • #8
                I'm not very fluent in Perl, but I think that might do the trick.

                For posterity, my approach in python would just be to make a defaultdict for each database, with the sequence strings as keys (and probably the SeqIO fasta or fasta ID instance for the value), then take do db1.keys().difference(db2.keys()) to find those sequences unique to each database.

                Comment


                • #9
                  I just cooked up some rough code. It needs to store sequence and ID in memory for all sequences in the first file, but should be otherwise fairly space/time efficient. Here's an overview of how it does it:
                  1. Read all lines from file1, store sequence, id as hash (hash to sequence)
                  2. Read all lines from file2, print out non-matching sequence/id and delete any matching sequence/id
                  3. Print remaining non-matching sequence/id from file1


                  Code:
                  #!/usr/bin/perl
                  use warnings;
                  use strict;
                  
                  open(my $f1, "< file1.fasta") or die("Cannot open file1");
                  open(my $f2, "< file2.fasta") or die("Cannot open file2");
                  
                  my %seenSequences = ();
                  my $sequence = "";
                  my $seqID = "";
                  
                  while(<$f1>){
                    chomp;
                    if(/^>(.*)$/){
                      $seenSequences{$sequence} = $seqID if $seqID ne "";
                      $sequence = "";
                      $seqID = $1;
                    } else {
                      $sequence .= $_;
                    }
                  }
                  $seenSequences{$sequence} = $seqID if $seqID ne "";
                  close($f1);
                  
                  $sequence = "";
                  $seqID = "";
                  while(<$f2>){
                    chomp;
                    if(/^>(.*)$/){
                      if(($seqID ne "") && !exists($seenSequences{$sequence})){
                        printf(">%s [2]\n%s\n", $seqID, $sequence);
                      } else {
                        delete($seenSequences{$sequence});
                      }
                      $seqID = $1;
                      $sequence = "";
                    } else {
                      $sequence .= $_;
                    }
                  }
                  if(($seqID ne "") && !exists($seenSequences{$sequence})){
                    printf(">%s [2]\n%s\n", $seqID, $sequence);
                  } else {
                    delete($seenSequences{$sequence});
                  }
                  close($f1);
                  
                  while(my ($seq, $id) = each(%seenSequences)){
                    printf(">%s [1]\n%s\n", $id, $seq);
                  }
                  Here's an example run:
                  Code:
                   ./141964.pl > out.fasta && head *.fasta
                  ==> file1.fasta <==                                                                                                                                                                                                               
                  >1                                                                                                                                                                                                                                
                  PRTEINEIN                                                                                                                                                                                                                         
                  >3
                  PRTEINTHREE
                  
                  ==> file2.fasta <==
                  >1
                  PRTEINEIN
                  >2
                  PRTEINNI
                  
                  ==> out.fasta <==
                  >2 [2]
                  PRTEINNI
                  >3 [1]
                  PRTEINTHREE
                  Last edited by gringer; 06-03-2014, 04:08 PM.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  18 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  22 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  47 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X