![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Extract a fasta sequence based on Id AND length | Gabriel_ | Bioinformatics | 4 | 11-10-2015 01:20 PM |
How to Extract Multiple Sequence from Multi Fasta File by ID list | Anti | Bioinformatics | 4 | 02-02-2015 05:02 AM |
Extract sequences from fasta in Windows | niwa | Bioinformatics | 0 | 08-08-2014 01:49 AM |
Extract gene sequences from gff3 file and reference fasta | JonB | Bioinformatics | 1 | 07-15-2014 01:13 AM |
How to use coordinates in order to extract sequences in FASTA file? | prs321 | Bioinformatics | 1 | 09-14-2013 10:07 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: zhejiang, china Join Date: May 2011
Posts: 34
|
![]()
I've searched in the forum and google as well, but most of cases are extract sequences from fasta file based on id list. but now I only have several sequences without ids and i want to extract from a fasta file based on these sequences, does anyone have clues for it? Actually, that means I need the sequences identifier (or id) from the fasta file.
Thank you very much for your help. |
![]() |
![]() |
![]() |
#2 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
You can extract sequences that share kmers with your sequences with BBDuk:
Code:
bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31 You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Bay Area Join Date: Jun 2012
Posts: 120
|
![]()
There's almost certainly lots of ways to do this depending on what tools you're comfortable with.
In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file. It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record. I'm sure some folks might have some grep based methods for the commandline as well |
![]() |
![]() |
![]() |
#4 | |
Member
Location: zhejiang, china Join Date: May 2011
Posts: 34
|
![]()
3xs for the info.
Quote:
|
|
![]() |
![]() |
![]() |
#5 | |
Member
Location: zhejiang, china Join Date: May 2011
Posts: 34
|
![]()
Actually, I do prefer command based under shell environment. grep seems a great way to do it, thanks for tips.
Quote:
|
|
![]() |
![]() |
![]() |
#6 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
Here are my Python scripts to do this, with Galaxy wrappers:
This filters the FASTA file (loads all the IDs, then goes through the FASTA file once): https://github.com/peterjc/pico_gala...q_filter_by_id This indexes the FASTA file, then goes through the IDs one by one: https://github.com/peterjc/pico_gala...q_select_by_id The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower). |
![]() |
![]() |
![]() |
#7 | |
Member
Location: zhejiang, china Join Date: May 2011
Posts: 34
|
![]()
Seems these script also deal with a list of ids, then using these ids to fetch sequences in a fasta file. I wanna use sequences directly and get a subset of fasta from a big fasta file base on provided sequences. Thank you for your information as well.
Quote:
|
|
![]() |
![]() |
![]() |
#8 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
What scripting/programming language(s) are you learning?
|
![]() |
![]() |
![]() |
#9 |
Member
Location: zhejiang, china Join Date: May 2011
Posts: 34
|
![]()
Actually, I do more wet lab most of the time. Sometimes, I'll also do simple small rna analysis with ready-to-use perl script and simple shell script. It's enough most of the time with my work, but sometime it's not easy for me.
|
![]() |
![]() |
![]() |
#10 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.
|
![]() |
![]() |
![]() |
#11 |
Member
Location: zhejiang, china Join Date: May 2011
Posts: 34
|
![]() |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,089
|
![]()
Brian's solution should work. Did you try it?
While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines. |
![]() |
![]() |
![]() |
#13 |
Member
Location: zhejiang, china Join Date: May 2011
Posts: 34
|
![]()
Yes, I've tried bbduk.sh.
bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31 But my situation is that b.fa is not fasta file, it contain one sequence per line. I just want the sequence in b from a.fa, than make a new fasta file (c.fa). since my b.fa is not a fasta file, so bbduk.sh give some error: Exception in thread "Thread-9" java.lang.RuntimeException: Error parsing read from text. |
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,089
|
![]()
If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)
Code:
$ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta |
![]() |
![]() |
![]() |
#15 | |
Member
Location: zhejiang, china Join Date: May 2011
Posts: 34
|
![]()
Thank you for the code. It can change my sequences to fasta file. And I try bbduk.fas again, but the result is not as expected. An example will be more easier to understand. there are two fasta
original.fas >1123-11234 aaaaaa >wer atgcca >ad ctaacg >232-23424 tttttt >323-342 cacaaa >416-2 gggggg >13424241234-23423 cccccc >5-234 cggcgtcacgttggttgttga ref.fas(after I make fasta using your awk script) >1 aaaaaa >2 tttttt >3 gggggg >4 cccccc I use "bbmap/bbduk.sh in=original.fas ref=ref.fas out=out.fas mkf=1 mm=f k=21" out.fas is like this >5-234 cggcgtcacgttggttgttga actually, I want a fasta like this >1123-11234 aaaaaa >232-23424 tttttt >416-2 gggggg >13424241234-23423 cccccc Just like fetch the id from the original.fas Quote:
|
|
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: germany Join Date: Oct 2009
Posts: 140
|
![]()
I usually write a small basic program for such problems.
post/send the file, I send the result ? |
![]() |
![]() |
![]() |
#17 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.
|
![]() |
![]() |
![]() |
#18 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,089
|
![]()
@entomology: Try the following
Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line). Code:
$ while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas |
![]() |
![]() |
![]() |
#19 |
Senior Member
Location: Vancouver, BC Join Date: Mar 2010
Posts: 275
|
![]()
Here is a simple script that uses an iterator to fetch records by sequence. This would be likely faster and less error-prone than grep:
Code:
#!/usr/bin/env perl use strict; use warnings; use File::Basename; my $usage = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas"; my $infilei = shift or die $usage; my $infilej = shift or die $usage; my %hash; open my $ini, '<', $infilei or die $!; while (my ($id, $seq) = fasta_it(\*$ini)) { $hash{$seq} = $id; } close $ini; open my $inj, '<', $infilej or die $!; while (my ($id, $seq) = fasta_it(\*$inj)) { if (exists $hash{$seq}) { print join "\n", ">".$hash{$seq}, "$seq\n"; } } close $inj; sub fasta_it { my ($fh) = @_; local $/ = "\n>"; return unless my $entry = $fh->getline; chomp $entry; my ($id, $seq) = split /\n/, $entry, 2; defined $id && $id =~ s/>//g; return ($id, $seq); } The output: Code:
perl fetch_by_seq.pl i.fas j.fas >1123-11234 aaaaaa >232-23424 tttttt >416-2 gggggg >13424241234-23423 cccccc |
![]() |
![]() |
![]() |
#20 |
Member
Location: zhejiang, china Join Date: May 2011
Posts: 34
|
![]()
I've upload the two file, the expected output is like this:
>1123-11234 aaaaaa >232-23424 tttttt >416-2 gggggg >13424241234-23423 cccccc Thanks! |
![]() |
![]() |
![]() |
Thread Tools | |
|
|