SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract a fasta sequence based on Id AND length Gabriel_ Bioinformatics 4 11-10-2015 01:20 PM
How to Extract Multiple Sequence from Multi Fasta File by ID list Anti Bioinformatics 4 02-02-2015 05:02 AM
Extract sequences from fasta in Windows niwa Bioinformatics 0 08-08-2014 01:49 AM
Extract gene sequences from gff3 file and reference fasta JonB Bioinformatics 1 07-15-2014 01:13 AM
How to use coordinates in order to extract sequences in FASTA file? prs321 Bioinformatics 1 09-14-2013 10:07 AM

Reply
 
Thread Tools
Old 12-16-2015, 04:44 PM   #1
entomology
Member
 
Location: zhejiang, china

Join Date: May 2011
Posts: 34
Default Extract multiple fasta sequences from a fasta file based on sequenes

I've searched in the forum and google as well, but most of cases are extract sequences from fasta file based on id list. but now I only have several sequences without ids and i want to extract from a fasta file based on these sequences, does anyone have clues for it? Actually, that means I need the sequences identifier (or id) from the fasta file.

Thank you very much for your help.
entomology is offline   Reply With Quote
Old 12-16-2015, 05:35 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can extract sequences that share kmers with your sequences with BBDuk:

Code:
bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient.
Brian Bushnell is offline   Reply With Quote
Old 12-16-2015, 05:41 PM   #3
cmbetts
Member
 
Location: Bay Area

Join Date: Jun 2012
Posts: 96
Default

There's almost certainly lots of ways to do this depending on what tools you're comfortable with.

In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file.

It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record.

I'm sure some folks might have some grep based methods for the commandline as well
cmbetts is offline   Reply With Quote
Old 12-17-2015, 02:20 AM   #4
entomology
Member
 
Location: zhejiang, china

Join Date: May 2011
Posts: 34
Default

3xs for the info.


Quote:
Originally Posted by Brian Bushnell View Post
You can extract sequences that share kmers with your sequences with BBDuk:

Code:
bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient.
entomology is offline   Reply With Quote
Old 12-17-2015, 02:27 AM   #5
entomology
Member
 
Location: zhejiang, china

Join Date: May 2011
Posts: 34
Default

Actually, I do prefer command based under shell environment. grep seems a great way to do it, thanks for tips.


Quote:
Originally Posted by cmbetts View Post
There's almost certainly lots of ways to do this depending on what tools you're comfortable with.

In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file.

It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record.

I'm sure some folks might have some grep based methods for the commandline as well
entomology is offline   Reply With Quote
Old 12-17-2015, 07:15 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Here are my Python scripts to do this, with Galaxy wrappers:

This filters the FASTA file (loads all the IDs, then goes through the FASTA file once):
https://github.com/peterjc/pico_gala...q_filter_by_id

This indexes the FASTA file, then goes through the IDs one by one:
https://github.com/peterjc/pico_gala...q_select_by_id

The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower).
maubp is offline   Reply With Quote
Old 12-17-2015, 08:49 AM   #7
entomology
Member
 
Location: zhejiang, china

Join Date: May 2011
Posts: 34
Default

Seems these script also deal with a list of ids, then using these ids to fetch sequences in a fasta file. I wanna use sequences directly and get a subset of fasta from a big fasta file base on provided sequences. Thank you for your information as well.


Quote:
Originally Posted by maubp View Post
Here are my Python scripts to do this, with Galaxy wrappers:

This filters the FASTA file (loads all the IDs, then goes through the FASTA file once):
https://github.com/peterjc/pico_gala...q_filter_by_id

This indexes the FASTA file, then goes through the IDs one by one:
https://github.com/peterjc/pico_gala...q_select_by_id

The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower).
entomology is offline   Reply With Quote
Old 12-17-2015, 09:03 AM   #8
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

What scripting/programming language(s) are you learning?
maubp is offline   Reply With Quote
Old 12-17-2015, 11:48 AM   #9
entomology
Member
 
Location: zhejiang, china

Join Date: May 2011
Posts: 34
Default

Actually, I do more wet lab most of the time. Sometimes, I'll also do simple small rna analysis with ready-to-use perl script and simple shell script. It's enough most of the time with my work, but sometime it's not easy for me.

Quote:
Originally Posted by maubp View Post
What scripting/programming language(s) are you learning?
entomology is offline   Reply With Quote
Old 12-17-2015, 12:28 PM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.
maubp is offline   Reply With Quote
Old 12-17-2015, 12:39 PM   #11
entomology
Member
 
Location: zhejiang, china

Join Date: May 2011
Posts: 34
Default

No worry, I'll try to use grep to deal with the problem .
Quote:
Originally Posted by maubp View Post
Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.
entomology is offline   Reply With Quote
Old 12-17-2015, 01:08 PM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,856
Default

Brian's solution should work. Did you try it?

While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.
GenoMax is offline   Reply With Quote
Old 12-17-2015, 02:00 PM   #13
entomology
Member
 
Location: zhejiang, china

Join Date: May 2011
Posts: 34
Default

Yes, I've tried bbduk.sh.

bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31

But my situation is that b.fa is not fasta file, it contain one sequence per line. I just want the sequence in b from a.fa, than make a new fasta file (c.fa).

since my b.fa is not a fasta file, so bbduk.sh give some error:

Exception in thread "Thread-9" java.lang.RuntimeException: Error parsing read from text.


Quote:
Originally Posted by GenoMax View Post
Brian's solution should work. Did you try it?

While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.
entomology is offline   Reply With Quote
Old 12-17-2015, 02:23 PM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,856
Default

If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

Code:
$ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta
Then use the file with BBDuk.
GenoMax is offline   Reply With Quote
Old 12-17-2015, 03:03 PM   #15
entomology
Member
 
Location: zhejiang, china

Join Date: May 2011
Posts: 34
Default

Thank you for the code. It can change my sequences to fasta file. And I try bbduk.fas again, but the result is not as expected. An example will be more easier to understand. there are two fasta

original.fas
>1123-11234
aaaaaa
>wer
atgcca
>ad
ctaacg
>232-23424
tttttt
>323-342
cacaaa
>416-2
gggggg
>13424241234-23423
cccccc
>5-234
cggcgtcacgttggttgttga


ref.fas(after I make fasta using your awk script)
>1
aaaaaa
>2
tttttt
>3
gggggg
>4
cccccc

I use "bbmap/bbduk.sh in=original.fas ref=ref.fas out=out.fas mkf=1 mm=f k=21"

out.fas is like this
>5-234
cggcgtcacgttggttgttga

actually, I want a fasta like this

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Just like fetch the id from the original.fas


Quote:
Originally Posted by GenoMax View Post
If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

Code:
$ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta
Then use the file with BBDuk.
entomology is offline   Reply With Quote
Old 12-18-2015, 10:14 AM   #16
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

I usually write a small basic program for such problems.

post/send the file, I send the result ?
gsgs is offline   Reply With Quote
Old 12-18-2015, 10:27 AM   #17
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.
Brian Bushnell is offline   Reply With Quote
Old 12-18-2015, 10:29 AM   #18
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,856
Default

@entomology: Try the following

Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

Code:
$  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas
GenoMax is offline   Reply With Quote
Old 12-18-2015, 12:05 PM   #19
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Here is a simple script that uses an iterator to fetch records by sequence. This would be likely faster and less error-prone than grep:

Code:
#!/usr/bin/env perl

use strict;
use warnings;
use File::Basename;

my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
my $infilei = shift or die $usage;
my $infilej = shift or die $usage;

my %hash;
open my $ini, '<', $infilei or die $!;
while (my ($id, $seq) = fasta_it(\*$ini)) {
    $hash{$seq} = $id;
}
close $ini;

open my $inj, '<', $infilej or die $!;
while (my ($id, $seq) = fasta_it(\*$inj)) {
    if (exists $hash{$seq}) {
	print join "\n", ">".$hash{$seq}, "$seq\n";
    }
}
close $inj;

sub fasta_it {
    my ($fh) = @_;
    
    local $/ = "\n>";
    return unless my $entry = $fh->getline;
    chomp $entry;

    my ($id, $seq) = split /\n/, $entry, 2;
    defined $id && $id =~ s/>//g;
    return ($id, $seq);
}
Here is the gist for easier download: https://gist.github.com/sestaton/889cba88b5279a58d997

The output:

Code:
perl fetch_by_seq.pl i.fas j.fas 
>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc
Depending on the size of the original file you may want to think about using an SQLite database but this should work fine for most uses.
SES is offline   Reply With Quote
Old 12-18-2015, 01:25 PM   #20
entomology
Member
 
Location: zhejiang, china

Join Date: May 2011
Posts: 34
Default

I've upload the two file, the expected output is like this:

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Thanks!

Quote:
Originally Posted by gsgs View Post
I usually write a small basic program for such problems.

post/send the file, I send the result ?
Attached Files
File Type: txt original.txt (159 Bytes, 6 views)
File Type: txt seq.txt (30 Bytes, 3 views)
entomology is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:08 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO