Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • perl : Remove redundant feature in fasta file

    Hi,

    I have some trouble with removing redundant feature in a fasta file. I want to create an indexe for Bowtie.
    I have this file :
    Code:
    >pi1
    [B]ATGCGTGAAATGCAT[/B]
    >pi2
    TGCCCTGATAGGGACCAGTAGAC
    >pi3
    [B]ATGCGTGAAATGCAT[/B]A
    >pi4
    TGCATGACTA
    >pi5
    [B]ATGCGTGAAATGCAT[/B]AT
    But some feature have the same sequence but differ in length. I want to keep only the longer sequence. Finally, i would like to have a file like :
    Code:
    >pi2
    TGCCCTGATAGGGACCAGTAGAC
    >pi4
    TGCATGACTA
    >pi5
    [B]ATGCGTGAAATGCAT[/B]AT
    I don't know how i can do this ... maybe someone can help me ?

    Thanks in advance for your reply !

  • #2
    I think PRINSEQ will do exactly what you are wanting. Take a look at the manual under the section about sequence duplications.

    Comment


    • #3
      here you go:
      Frontage is a Pharmaceutical Research & Development Organization focused on advancing drug discovery and development through integrated laboratory, analytical, and clinical services


      Code:
      #!/usr/bin/perl
      #	by Yonggan Wu
      #	[email][email protected][/email]
      #	Ocean Ridge Bioscience LLC
      #	Version 01 Date: 2012-02-16 15:01:08 
      #	Version 01 Updates: 
      #	Input file: a fasta file
      #	Output file: a unique fasta file
      #	System Requirements: linux, perl
      #	Usage: perl test.pl infile.fasta
      ################################################################################
      use strict;
      use warnings;
      #read the file into a hash
      my %seq;
      my $title;
      my $infile=shift or die "give me a infile\n"
      open (IN,"$infile");
      while (<IN>){
      	$_=~s/\n//;
      	$_=~s/\r//;
      	if ($_=~/>/){
      		$title=$_;
      		$title=~s/>//;
      	}
      	else{
      		$seq{$_}=$title;
      	}
      }
      close IN;
      #remove the abundant sequences
      my @seq=keys (%seq);
      my @uniqueseq;
      my $find=0;
      foreach (@seq){
      	$find=0;
      	my $seq=uc($_);
      	foreach (@uniqueseq){
      		if ($seq=~/$_/){
      			$_=$seq;#replace with longer seq
      			$find=1;
      		}
      		if ($_=~/$seq/){
      			$find=1;
      		}
      	}
      	if ($find==0){
      		push @uniqueseq,$seq;
      	}
      }
      #outout the final result
      open (OUT,">output.fasta");
      foreach (@uniqueseq){
      	print OUT ">$seq{$_}\n$_\n";
      }
      close OUT;

      Comment


      • #4
        thanks a lot, it's exactly what i need !

        Comment


        • #5
          I re open this thread because i solved the problem with the script bellow, but now, i have bigger file and it's time consuming.
          I try to solve my problem with PRINSEQ, with the following comand line, but it did'nt work, it only remove reads that have the exact same sequence
          perl prinseq-lite.pl -verbose -fasta tmp1.fa -derep 123 -out_format 1
          Someone is familiar with this tool ?
          Here is an example of what i would like to do :
          INPUT :
          >pi1
          AAAAAAAAAATTAAGGGCCAGCTGA
          >pi12
          AAAAAAAAAATTAAGGGCCAGCTGAA
          >pi13
          AAAAAAAAACTTGAACTCTACTGC
          >pi14
          AAAAAAAAATTAAGGGCCAGCTGAA
          >pi15
          AAAAAAAAATTTTGGATGATCTTAAT
          >pi16
          AAAAAAAAATTTTGGATGATCTTAATT
          >pi17
          AAAAAAAACAAGGTCGGCATAAAG
          >pi18
          AAAAAAAACGAACATGAGAGGATGGA
          OUTPUT :
          >pi12
          AAAAAAAAAATTAAGGGCCAGCTGAA
          >pi13
          AAAAAAAAACTTGAACTCTACTGC
          >pi16
          AAAAAAAAATTTTGGATGATCTTAATT
          >pi17
          AAAAAAAACAAGGTCGGCATAAAG
          >pi18
          AAAAAAAACGAACATGAGAGGATGGA
          Thanks in advance for your help !

          Comment


          • #6
            The "1" in the argument
            Code:
            -derep 123
            to prinseq-lite.pl says to remove exact duplicates only. As I understand your question now, I don't think prinseq will actually do what you want. Have you tried the perl script posted by weastem above? I have not tested it but that may be a solution for you.

            Comment


            • #7
              Yes i tested the script but it's really time consuming for my dataset; it is running since more than 24 hours ...

              Comment


              • #8
                I re opened this thread because i have a similar problem.
                I have family of sequence sharing the same "motif" as in my example below :
                >a
                ACTTTCCACAACGATGGAAGATGATGA
                >b
                ACTTTCCACAACGATGGAAGATGATGAA
                >c
                ATTCCACAACGATGGAAGATGATGAA
                >d
                CTTTCCACAACGATGGAAGATGATGAA
                >e
                NTCCACAACGATGGAAGATGATGAAGA
                >f
                TACTTTCCACAACGATGGAAGATGATGA
                >g
                TACTTTCCACAACGATGGAAGATGATGAA
                >h
                TCCACAACGATGGAAGATGATGA
                I would like to get the smallest sequence sharing the motif, in my example :
                >h
                TCCACAACGATGGAAGATGATGA
                How can i do this ? is there any tools able to do this ?

                Comment


                • #9
                  grep your sequence & sort results by length is a start

                  Comment


                  • #10
                    Given a file with unaligned sequences s1...sm of different lengths.
                    You want to quickly identify sequences that are substrings of another one.
                    ----------------------------------------------------------
                    Compute and store sequence lengths.

                    Pick some random substrings t1,...,tk of suitable length l ,
                    k and l to be calculated from m and the lengths for optimal speed and memory usage.

                    For each (i,j) compute and store whether sequence si contains substring tj.
                    This can be calculated in one step, by going through all the sequences and letters.

                    Walk through the sequence pairs (i,j) and consider pairs only where each
                    substring covered by i is also covered by j and length(j)>=length(i)

                    These have then to be checked in detail then whether i is a substring of j,
                    but most pairs are quickly discarded and needn't be considered,
                    so I think this would be pretty fast.


                    {for the definitions of subsequence and substring see wikipedia}

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-27-2024, 06:37 PM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-27-2024, 06:07 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    69 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X