Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I usually write a small basic program for such problems.

    post/send the file, I send the result ?

    Comment


    • #17
      Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.

      Comment


      • #18
        @entomology: Try the following

        Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

        Code:
        $  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

        Comment


        • #19
          Here is a simple script that uses an iterator to fetch records by sequence. This would be likely faster and less error-prone than grep:

          Code:
          #!/usr/bin/env perl
          
          use strict;
          use warnings;
          use File::Basename;
          
          my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
          my $infilei = shift or die $usage;
          my $infilej = shift or die $usage;
          
          my %hash;
          open my $ini, '<', $infilei or die $!;
          while (my ($id, $seq) = fasta_it(\*$ini)) {
              $hash{$seq} = $id;
          }
          close $ini;
          
          open my $inj, '<', $infilej or die $!;
          while (my ($id, $seq) = fasta_it(\*$inj)) {
              if (exists $hash{$seq}) {
          	print join "\n", ">".$hash{$seq}, "$seq\n";
              }
          }
          close $inj;
          
          sub fasta_it {
              my ($fh) = @_;
              
              local $/ = "\n>";
              return unless my $entry = $fh->getline;
              chomp $entry;
          
              my ($id, $seq) = split /\n/, $entry, 2;
              defined $id && $id =~ s/>//g;
              return ($id, $seq);
          }
          Here is the gist for easier download: https://gist.github.com/sestaton/889cba88b5279a58d997

          The output:

          Code:
          perl fetch_by_seq.pl i.fas j.fas 
          >1123-11234
          aaaaaa
          >232-23424
          tttttt
          >416-2
          gggggg
          >13424241234-23423
          cccccc
          Depending on the size of the original file you may want to think about using an SQLite database but this should work fine for most uses.

          Comment


          • #20
            I've upload the two file, the expected output is like this:

            >1123-11234
            aaaaaa
            >232-23424
            tttttt
            >416-2
            gggggg
            >13424241234-23423
            cccccc

            Thanks!

            Originally posted by gsgs View Post
            I usually write a small basic program for such problems.

            post/send the file, I send the result ?
            Attached Files

            Comment


            • #21
              Actually, I've deal with some small rna sequence which is with length of 18-30. Anyway, thank you for your kindness.


              Originally posted by Brian Bushnell View Post
              Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.

              Comment


              • #22
                Originally posted by entomology View Post
                I've upload the two file, the expected output is like this:

                >1123-11234
                aaaaaa
                >232-23424
                tttttt
                >416-2
                gggggg
                >13424241234-23423
                cccccc

                Thanks!
                That is what the post above (#19) produces. The question and expected results seem pretty simple, maybe you missed the previous post or are looking for another way?

                Comment


                • #23
                  Forgive my poor programming skill, still I got some error message as below

                  -bash: syntax error near unexpected token `do'



                  Originally posted by GenoMax View Post
                  @entomology: Try the following

                  Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

                  Code:
                  $  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

                  Comment


                  • #24
                    I've got below message when I running the script, do i miss some module?

                    Can't locate object method "getline" via package "IO::Handle" at ./fetch_fasta.test line 30

                    Originally posted by SES View Post
                    That is what the post above (#19) produces. The question and expected results seem pretty simple, maybe you missed the previous post or are looking for another way?

                    Comment


                    • #25
                      Originally posted by entomology View Post
                      I've got below message when I running the script, do i miss some module?

                      Can't locate object method "getline" via package "IO::Handle" at ./fetch_fasta.test line 30
                      Interesting, never seen that. What OS are you using? To fix the Perl issue, replace this line:

                      Code:
                      return unless my $entry = $fh->getline;
                      with

                      Code:
                      return unless my $entry = <$fh>;
                      should do the trick.

                      Comment


                      • #26
                        I'm using centos

                        2.6.32-573.7.1.el6.centos.plus.x86_64

                        when i change as you recommended, this time I got result. but still with some error and the result seems not as expected.

                        ./fetch_fasta.test seq.test seq.test.original.fas
                        Value of <HANDLE> construct can be "0"; test with defined() at ./fetch_fasta.test line 31.
                        >1
                        aaaaaa
                        >2
                        tttttt
                        >3
                        gggggg

                        I expected the result like this:

                        >1123-11234
                        aaaaaa
                        >232-23424
                        tttttt
                        >416-2
                        gggggg
                        >13424241234-23423
                        cccccc



                        Originally posted by SES View Post
                        Interesting, never seen that. What OS are you using? To fix the Perl issue, replace this line:

                        Code:
                        return unless my $entry = $fh->getline;
                        with

                        Code:
                        return unless my $entry = <$fh>;
                        should do the trick.

                        Comment


                        • #27
                          Originally posted by entomology View Post
                          I'm using centos

                          2.6.32-573.7.1.el6.centos.plus.x86_64

                          when i change as you recommended, this time I got result. but still with some error and the result seems not as expected.

                          ./fetch_fasta.test seq.test seq.test.original.fas
                          Value of <HANDLE> construct can be "0"; test with defined() at ./fetch_fasta.test line 31.
                          >1
                          aaaaaa
                          >2
                          tttttt
                          >3
                          gggggg

                          I expected the result like this:

                          >1123-11234
                          aaaaaa
                          >232-23424
                          tttttt
                          >416-2
                          gggggg
                          >13424241234-23423
                          cccccc
                          You must be giving the files in the wrong order because it works for me. It needs to be "perl script.pl original.fas other.fas." Though, the approach I posted is odd since there is something wrong with readline on your system and it will generate warnings. Here is a different approach that should work the same:

                          Code:
                          #!/usr/bin/env perl
                          
                          use strict;
                          use warnings;
                          use File::Basename;
                          
                          my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
                          my $infilei = shift or die $usage;
                          my $infilej = shift or die $usage;
                          
                          my $hash = index_seq($infilei);
                          open my $inj, '<', $infilej or die $!;
                          {
                              local $/ = '>';
                          
                              while (my $line = <$inj>) {
                          	chomp $line;
                          	my ($seqid, @seqparts) = split /\n/, $line;
                          	my $seq = join '', @seqparts;
                          	next unless defined $seqid && defined $seq;
                          	if (exists $hash->{$seq}) {
                          	    print join "\n", ">".$hash->{$seq}, "$seq\n";
                          	}
                              }
                          }
                          close $inj;
                          
                          sub index_seq {
                              my ($file) = @_;
                              open my $in, '<', $file or die $!;
                          
                              my %hash;
                              {
                          	local $/ = '>';
                          
                          	while (my $line = <$in>) {
                          	    chomp $line;
                          	    my ($seqid, @seqparts) = split /\n/, $line;
                          	    my $seq = join '', @seqparts;
                          	    next unless defined $seqid && defined $seq;
                          	    $hash{$seq} = $seqid;
                          	}
                              }
                              close $in;
                          
                              return \%hash;
                          }
                          By the way, can you tell me the output of "perl -v" and "perl -MIO::Handle -e 'print IO::Handle->VERSION'". Thanks. I am also using CentOS, so I'm curious about that message.
                          Last edited by SES; 12-18-2015, 02:14 PM.

                          Comment


                          • #28
                            Thank you very much for your help. you are right, i get the order wrong. now it works really good with the updated version without warning.

                            the command test gives the result as below:

                            perl -v

                            This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi

                            perl -MIO::Handle -e 'print IO::Handle->VERSION'
                            1.28

                            thanks again for your patience and useful help!!!

                            Originally posted by SES View Post
                            You must be giving the files in the wrong order because it works for me. It needs to be "perl script.pl original.fas other.fas." Though, the approach I posted is odd since there is something wrong with readline on your system and it will generate warnings. Here is a different approach that should work the same:

                            Code:
                            #!/usr/bin/env perl
                            
                            use strict;
                            use warnings;
                            use File::Basename;
                            
                            my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
                            my $infilei = shift or die $usage;
                            my $infilej = shift or die $usage;
                            
                            my $hash = index_seq($infilei);
                            open my $inj, '<', $infilej or die $!;
                            {
                                local $/ = '>';
                            
                                while (my $line = <$inj>) {
                            	chomp $line;
                            	my ($seqid, @seqparts) = split /\n/, $line;
                            	my $seq = join '', @seqparts;
                            	next unless defined $seqid && defined $seq;
                            	if (exists $hash->{$seq}) {
                            	    print join "\n", ">".$hash->{$seq}, "$seq\n";
                            	}
                                }
                            }
                            close $inj;
                            
                            sub index_seq {
                                my ($file) = @_;
                                open my $in, '<', $file or die $!;
                            
                                my %hash;
                                {
                            	local $/ = '>';
                            
                            	while (my $line = <$in>) {
                            	    chomp $line;
                            	    my ($seqid, @seqparts) = split /\n/, $line;
                            	    my $seq = join '', @seqparts;
                            	    next unless defined $seqid && defined $seq;
                            	    $hash{$seq} = $seqid;
                            	}
                                }
                                close $in;
                            
                                return \%hash;
                            }
                            By the way, can you tell me the output of "perl -v" and "perl -MIO::Handle -e 'print IO::Handle->VERSION'". Thanks. I am also using CentOS, so I'm curious about that message.

                            Comment


                            • #29
                              Originally posted by entomology View Post
                              Thank you very much for your help. you are right, i get the order wrong. now it works really good with the updated version without warning.

                              the command test gives the result as below:

                              perl -v

                              This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi

                              perl -MIO::Handle -e 'print IO::Handle->VERSION'
                              1.28

                              thanks again for your patience and useful help!!!
                              Great, thanks for the info. I'm creating a CentOS 6 image now to test your errors above.

                              Comment


                              • #30
                                Originally posted by entomology View Post
                                Forgive my poor programming skill, still I got some error message as below

                                -bash: syntax error near unexpected token `do'
                                Are you running bash shell? If you are not then try explicitly going into bash like this

                                Code:
                                $ /bin/bash
                                $ while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X