Seqanswers Leaderboard Ad

**gsgs** · 12-18-2015, 10:14 AM

I usually write a small basic program for such problems.

post/send the file, I send the result ?

**Brian Bushnell** · 12-18-2015, 10:27 AM

Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.

**GenoMax** · 12-18-2015, 10:29 AM

@entomology: Try the following

Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

Code:

$  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

**SES** · 12-18-2015, 12:05 PM

Here is a simple script that uses an iterator to fetch records by sequence. This would be likely faster and less error-prone than grep:

Code:

#!/usr/bin/env perl

use strict;
use warnings;
use File::Basename;

my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
my $infilei = shift or die $usage;
my $infilej = shift or die $usage;

my %hash;
open my $ini, '<', $infilei or die $!;
while (my ($id, $seq) = fasta_it(\*$ini)) {
    $hash{$seq} = $id;
}
close $ini;

open my $inj, '<', $infilej or die $!;
while (my ($id, $seq) = fasta_it(\*$inj)) {
    if (exists $hash{$seq}) {
	print join "\n", ">".$hash{$seq}, "$seq\n";
    }
}
close $inj;

sub fasta_it {
    my ($fh) = @_;
    
    local $/ = "\n>";
    return unless my $entry = $fh->getline;
    chomp $entry;

    my ($id, $seq) = split /\n/, $entry, 2;
    defined $id && $id =~ s/>//g;
    return ($id, $seq);
}

Here is the gist for easier download: https://gist.github.com/sestaton/889cba88b5279a58d997

The output:

Code:

perl fetch_by_seq.pl i.fas j.fas 
>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Depending on the size of the original file you may want to think about using an SQLite database but this should work fine for most uses.

**entomology** · 12-18-2015, 01:25 PM

I've upload the two file, the expected output is like this:

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Thanks!

Originally posted by gsgs View Post

I usually write a small basic program for such problems.

post/send the file, I send the result ?

Attached Files

**entomology** · 12-18-2015, 01:31 PM

Actually, I've deal with some small rna sequence which is with length of 18-30. Anyway, thank you for your kindness.

Originally posted by Brian Bushnell View Post

Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.

**SES** · 12-18-2015, 01:36 PM

Originally posted by entomology View Post

I've upload the two file, the expected output is like this:

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Thanks!

That is what the post above (#19) produces. The question and expected results seem pretty simple, maybe you missed the previous post or are looking for another way?

**entomology** · 12-18-2015, 01:37 PM

Forgive my poor programming skill, still I got some error message as below

-bash: syntax error near unexpected token `do'

Originally posted by GenoMax View Post

@entomology: Try the following

Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

Code:

$  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

**entomology** · 12-18-2015, 01:42 PM

I've got below message when I running the script, do i miss some module?

Can't locate object method "getline" via package "IO::Handle" at ./fetch_fasta.test line 30

Originally posted by SES View Post

That is what the post above (#19) produces. The question and expected results seem pretty simple, maybe you missed the previous post or are looking for another way?

**SES** · 12-18-2015, 01:49 PM

Originally posted by entomology View Post

I've got below message when I running the script, do i miss some module?

Can't locate object method "getline" via package "IO::Handle" at ./fetch_fasta.test line 30

Interesting, never seen that. What OS are you using? To fix the Perl issue, replace this line:

Code:

return unless my $entry = $fh->getline;

with

Code:

return unless my $entry = <$fh>;

should do the trick.

**entomology** · 12-18-2015, 01:58 PM

I'm using centos

2.6.32-573.7.1.el6.centos.plus.x86_64

when i change as you recommended, this time I got result. but still with some error and the result seems not as expected.

./fetch_fasta.test seq.test seq.test.original.fas
Value of <HANDLE> construct can be "0"; test with defined() at ./fetch_fasta.test line 31.
>1
aaaaaa
>2
tttttt
>3
gggggg

I expected the result like this:

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Originally posted by SES View Post

Interesting, never seen that. What OS are you using? To fix the Perl issue, replace this line:

Code:

return unless my $entry = $fh->getline;

with

Code:

return unless my $entry = <$fh>;

should do the trick.

**SES** · 12-18-2015, 02:12 PM

Originally posted by entomology View Post

I'm using centos

2.6.32-573.7.1.el6.centos.plus.x86_64

when i change as you recommended, this time I got result. but still with some error and the result seems not as expected.

./fetch_fasta.test seq.test seq.test.original.fas
Value of <HANDLE> construct can be "0"; test with defined() at ./fetch_fasta.test line 31.
>1
aaaaaa
>2
tttttt
>3
gggggg

I expected the result like this:

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

You must be giving the files in the wrong order because it works for me. It needs to be "perl script.pl original.fas other.fas." Though, the approach I posted is odd since there is something wrong with readline on your system and it will generate warnings. Here is a different approach that should work the same:

Code:

#!/usr/bin/env perl

use strict;
use warnings;
use File::Basename;

my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
my $infilei = shift or die $usage;
my $infilej = shift or die $usage;

my $hash = index_seq($infilei);
open my $inj, '<', $infilej or die $!;
{
    local $/ = '>';

    while (my $line = <$inj>) {
	chomp $line;
	my ($seqid, @seqparts) = split /\n/, $line;
	my $seq = join '', @seqparts;
	next unless defined $seqid && defined $seq;
	if (exists $hash->{$seq}) {
	    print join "\n", ">".$hash->{$seq}, "$seq\n";
	}
    }
}
close $inj;

sub index_seq {
    my ($file) = @_;
    open my $in, '<', $file or die $!;

    my %hash;
    {
	local $/ = '>';

	while (my $line = <$in>) {
	    chomp $line;
	    my ($seqid, @seqparts) = split /\n/, $line;
	    my $seq = join '', @seqparts;
	    next unless defined $seqid && defined $seq;
	    $hash{$seq} = $seqid;
	}
    }
    close $in;

    return \%hash;
}

By the way, can you tell me the output of "perl -v" and "perl -MIO::Handle -e 'print IO::Handle->VERSION'". Thanks. I am also using CentOS, so I'm curious about that message.

**entomology** · 12-18-2015, 02:23 PM

Thank you very much for your help. you are right, i get the order wrong. now it works really good with the updated version without warning.

the command test gives the result as below:

perl -v

This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi

perl -MIO::Handle -e 'print IO::Handle->VERSION'
1.28

thanks again for your patience and useful help!!!

Originally posted by SES View Post

You must be giving the files in the wrong order because it works for me. It needs to be "perl script.pl original.fas other.fas." Though, the approach I posted is odd since there is something wrong with readline on your system and it will generate warnings. Here is a different approach that should work the same:

Code:

#!/usr/bin/env perl

use strict;
use warnings;
use File::Basename;

my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
my $infilei = shift or die $usage;
my $infilej = shift or die $usage;

my $hash = index_seq($infilei);
open my $inj, '<', $infilej or die $!;
{
    local $/ = '>';

    while (my $line = <$inj>) {
	chomp $line;
	my ($seqid, @seqparts) = split /\n/, $line;
	my $seq = join '', @seqparts;
	next unless defined $seqid && defined $seq;
	if (exists $hash->{$seq}) {
	    print join "\n", ">".$hash->{$seq}, "$seq\n";
	}
    }
}
close $inj;

sub index_seq {
    my ($file) = @_;
    open my $in, '<', $file or die $!;

    my %hash;
    {
	local $/ = '>';

	while (my $line = <$in>) {
	    chomp $line;
	    my ($seqid, @seqparts) = split /\n/, $line;
	    my $seq = join '', @seqparts;
	    next unless defined $seqid && defined $seq;
	    $hash{$seq} = $seqid;
	}
    }
    close $in;

    return \%hash;
}

By the way, can you tell me the output of "perl -v" and "perl -MIO::Handle -e 'print IO::Handle->VERSION'". Thanks. I am also using CentOS, so I'm curious about that message.

**SES** · 12-18-2015, 02:40 PM

Originally posted by entomology View Post

Thank you very much for your help. you are right, i get the order wrong. now it works really good with the updated version without warning.

the command test gives the result as below:

perl -v

This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi

perl -MIO::Handle -e 'print IO::Handle->VERSION'
1.28

thanks again for your patience and useful help!!!

Great, thanks for the info. I'm creating a CentOS 6 image now to test your errors above.

**GenoMax** · 12-18-2015, 03:22 PM

Originally posted by entomology View Post

Forgive my poor programming skill, still I got some error message as below

-bash: syntax error near unexpected token `do'

Are you running bash shell? If you are not then try explicitly going into bash like this

Code:

$ /bin/bash
$ while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News