Unconfigured Ad

**ShellfishGene** · 06-04-2010, 10:35 PM

Personally I would find you program even more useful if there was an option to pipe sequence IDs and get the output on stdout!
I've been using bioperl in a simple script to do that, but a C++ program with more features would be nice, too!

Code:

#!/usr/bin/perl
use strict;
use warnings;
use Bio::DB::Fasta;

my $file = shift;

unless ( $file && -e $file ) { print "Usage: echo 'seq1:5..15' | get_seq.pl sequences.fasta\n      echo 'seq1' | get_seq.pl sequences.fasta\n"; exit; }

my $db = Bio::DB::Fasta->new( $file );

while (<>){
  my $query = $_;
  chomp $query;

  my $sequence;
  if ( $query =~ /:/ ) {
    $query =~ /^(.+):(\d+)\.\.(\d+)/;
    unless ( $1 && $2 && $3 ) {
      die "problem parsing request string.\n";
    }

    $sequence = $db->seq($1, $2 => $3);
  }
  else {
    $sequence = $db->seq($query);
  }

  unless ( $sequence ) { die "Sequence $query not found. \n" }
  print ">$query\n", "$sequence\n";

}

**ekg** · 06-05-2010, 06:12 AM

Originally posted by ShellfishGene View Post

Personally I would find you program even more useful if there was an option to pipe sequence IDs and get the output on stdout!
I've been using bioperl in a simple script to do that, but a C++ program with more features would be nice, too!

That's a great idea. Presently you can get the same effect by using xargs, but it would be quite a bit more cumbersome. I've thought of using a BED file as input to similar effect.

I like the sequence and subsequence specification syntax that the utility you posted uses. I will probably adapt that into FastaHack. Does it use 0 or 1-based coordinates?

**ShellfishGene** · 06-05-2010, 06:26 AM

Originally posted by ekg View Post

I like the sequence and subsequence specification syntax that the utility you posted uses. I will probably adapt that into FastaHack. Does it use 0 or 1-based coordinates?

The syntax is actually quite widespread I think, Ensembl uses it for example. It is 1-based. One variation, used on the UCSC browser pages, is to use - instead of '..'. You might want to support both.

**mslider** · 09-02-2010, 12:06 PM

another option

maybe usefull to add a parameter to only count the number of sequence...
when you have million of sequence, grep -c "^>" is very low !

**SES** · 09-03-2010, 05:46 AM

Originally posted by mslider View Post

maybe usefull to add a parameter to only count the number of sequence...
when you have million of sequence, grep -c "^>" is very low !

I disagree with part of this statement. There are myriad ways to index a fasta and these usually take a few seconds to a few minutes for millions of sequences. Then counting can take seconds. I just used grep to count 2.2 million 454 sequences and it took 13 seconds and did not create any huge index files. I would argue that grep is probably faster than creating an index then counting (in terms of overall time spent), but others may not agree. I agree that returning sequence stats from an index seems natural if you already have the index and it looks like this is on the author's to do list.

**Lee Sam** · 09-03-2010, 08:57 AM

Thanks for contributing this! It's literally exactly what I need.

**avilella** · 09-05-2010, 08:26 AM

There are two utilities that I am missing in current methods: I am using cdbfasta/cdbyank to index fastq files, but I would like to be able to compress the fastq file so that it takes up less space, even if it means a slower retrieval time. I would also like to be able to send a large number of ids, and retrieve the complement from them: the list of ids in the fastq file but not in the id list.

**mslider** · 09-06-2010, 12:36 AM

extract just subsequence

If you just want to extract a subsequence from a big sequence like a chromosome,
the program below is more faster and without creating index file:

Code:

#include<iostream>
#include<string.h>
#include<fstream>
#include <stdlib.h>
using namespace std;

 /* Steps:-
1- Download FASTA file and then remove the header. (>asdasfdasfassa)
2- Remove new lines from FASTA file. (using sed or perl)
3- Then you can use the C++  program like this in linux:

./ExtractSequence inputfilename start stop
*/

  int GetIntVal(string strConvert) {
              int intReturn;
              intReturn = atoi(strConvert.c_str());
              return(intReturn);
  }

int main(int argc ,char* argv[]){

       string line1;
	   ifstream myFile(argv[1]);
	   if(! myFile){
	      cout << "Error opening file" << endl;
		  return -1;
	   }
	   while(! myFile.eof()){
	       getline(myFile, line1);

			 string r1 = argv[2];
			 string r2 = argv[3];
			 int range1 = GetIntVal(r1);
			 int range2 =  GetIntVal(r2)- range1;
			 cout << ">Sample Sequence" << endl;
			 cout << line1.substr(range1,range2) << endl;
	   }
	   myFile.close();
    return 0;
}

**Thomas Doktor** · 09-06-2010, 05:14 AM

BEDTools' fastaFromBed utility allows you to extract (sub)sequences from a FASTA file using a BED/GFF/VCF file with intervals as input. It also supports strand specific sequence queries so you can extract strand specific features, such as exons.
BEDTools: http://code.google.com/p/bedtools/

**shaldenby** · 01-28-2013, 05:10 AM

This is a really useful little tool. Thanks very much!

**mattanswers** · 01-29-2013, 04:02 PM

Thank you ekg for your fastahack tool. The tool seems to extract the sequence by its position in the fasta file. I was wondering if it can extract a provided subsequence from the fasta file, and if so, what if the provided subsequence occurs multiple times in the fasta file ?

**ekg** · 01-29-2013, 04:07 PM

@mattanswers This sounds like a job for your favorite aligner. For short sequences, you can use smith-waterman (https://github.com/ekg/smithwaterman) but for bigger stuff I'd use something like blat or encode your sequences in FASTA and align them.

As for multiple mappings, you'll have to find a mapper that generates them. MOSAIK does, and I believe so does MrsFast.

**mattanswers** · 01-30-2013, 11:25 AM

Thank you for your help, and also for writing and sharing FastaHack.

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 39 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 61 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

FastaHack - FASTA file manipulation and subsequence extraction utilities

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News