SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
FASTQ manipulation in Galaxy blankenberg Bioinformatics 6 11-20-2012 11:59 PM
Whitespace in FASTA file mnkyboy Bioinformatics 2 11-13-2011 07:49 PM
MagicViewer - Not taking Fasta file BioTalk Bioinformatics 2 03-04-2010 03:48 AM
BEDTools: A flexible suite of utilities for comparing genomic features nilshomer Literature Watch 5 02-01-2010 09:36 AM
Tophat - fasta file ytmnd85 Bioinformatics 0 01-19-2010 12:38 PM

Reply
 
Thread Tools
Old 06-04-2010, 01:43 PM   #1
ekg
Member
 
Location: Boston, MA

Join Date: Apr 2010
Posts: 36
Default FastaHack - FASTA file manipulation and subsequence extraction utilities

I recently completed a C++ library and command-line utility which can be used to index and rapidly extract sequences and subsequences from FASTA files.

I wanted to publicize the work here as it has been quite useful to me and a number of my fellow lab members.

The utilities can be obtained using git via the project repository on github:

http://github.com/ekg/fastahack

Please let me know if obtaining the repository via git is a problem, and I will make a tarball release.

What follows is the FastaHack README:



fastahack --- *fast* FASTA file indexing, subsequence and sequence extraction

Author: Erik Garrison <erik.garrison@bc.edu>, Marth Lab, Boston College
Date: May 7, 2010


Overview:

fastahack is a small application for indexing and extracting sequences and
subsequences from FASTA files. The included Fasta.cpp library provides a FASTA
reader and indexer that can be embedded into applications which would benefit
from directly reading subsequences from FASTA files. The library automatically
handles index file generation and use.


Features:

- FASTA index (.fai) generation for FASTA files
- Sequence extraction
- Subsequence extraction
- Sequence statistics (TODO: currently only length is provided)

Sequence and subsequence extraction use fseek64 to provide fastest-possible
extraction without RAM-intensive file loading operations. This makes fastahack
a useful tool for bioinformaticists who need to quickly extract many
subsequences from a reference FASTA sequence.


Notes:

The index files generated by this system should be numerically equivalent to
those generated by samtools (http://samtools.sourceforge.net/). However, while
samtools truncates sequence names in the index file, fastahack provides them
completely.

To simplify use, sequences can be addressed by first whitespace-separated
field; e.g. "8 SN(Homo sapiens) GA(HG18) URI(NC_000008.9)" can be addressed
simply as "8", provided "8" is a unique first-field name in the FASTA file.
Thus, to extract 20bp starting at position 323202 in chromosome 8 from the
human reference:

% fastahack subsequence h.sapiens.fasta 8 323202 20
ACATTGTAATAGATCTCAGA

Usage information is provided by running fastahack with no arguments:

% fastahack
usage: fastahack <command> [options]
actions:
index <fasta reference>
sequence <fasta reference> <sequence name>
subsequence <fasta reference> <sequence name> <0-based start> <length>
stats <fasta reference> <sequence name> (returns sequence length)


Limitations:

fastahack will only generate indexes for FASTA files in which the sequences
have self-consistent line lengths. Trailing whitespace is allowed at the end
of sequences, but not embedded within the sequence. These limitations are
necessitated by the complexity of indexing sequences whose lines change in
length--- the use of indexes is frustrated by such inconsistencies; each change
in line length would require a new entry in the index file.
ekg is offline   Reply With Quote
Old 06-04-2010, 10:35 PM   #2
ShellfishGene
Member
 
Location: Germany

Join Date: Mar 2009
Posts: 14
Default

Personally I would find you program even more useful if there was an option to pipe sequence IDs and get the output on stdout!
I've been using bioperl in a simple script to do that, but a C++ program with more features would be nice, too!

Code:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::DB::Fasta;

my $file = shift;

unless ( $file && -e $file ) { print "Usage: echo 'seq1:5..15' | get_seq.pl sequences.fasta\n      echo 'seq1' | get_seq.pl sequences.fasta\n"; exit; }

my $db = Bio::DB::Fasta->new( $file );

while (<>){
  my $query = $_;
  chomp $query;

  my $sequence;
  if ( $query =~ /:/ ) {
    $query =~ /^(.+):(\d+)\.\.(\d+)/;
    unless ( $1 && $2 && $3 ) {
      die "problem parsing request string.\n";
    }

    $sequence = $db->seq($1, $2 => $3);
  }
  else {
    $sequence = $db->seq($query);
  }

  unless ( $sequence ) { die "Sequence $query not found. \n" }
  print ">$query\n", "$sequence\n";

}
ShellfishGene is offline   Reply With Quote
Old 06-05-2010, 06:12 AM   #3
ekg
Member
 
Location: Boston, MA

Join Date: Apr 2010
Posts: 36
Default

Quote:
Originally Posted by ShellfishGene View Post
Personally I would find you program even more useful if there was an option to pipe sequence IDs and get the output on stdout!
I've been using bioperl in a simple script to do that, but a C++ program with more features would be nice, too!
That's a great idea. Presently you can get the same effect by using xargs, but it would be quite a bit more cumbersome. I've thought of using a BED file as input to similar effect.

I like the sequence and subsequence specification syntax that the utility you posted uses. I will probably adapt that into FastaHack. Does it use 0 or 1-based coordinates?
ekg is offline   Reply With Quote
Old 06-05-2010, 06:26 AM   #4
ShellfishGene
Member
 
Location: Germany

Join Date: Mar 2009
Posts: 14
Default

Quote:
Originally Posted by ekg View Post
I like the sequence and subsequence specification syntax that the utility you posted uses. I will probably adapt that into FastaHack. Does it use 0 or 1-based coordinates?
The syntax is actually quite widespread I think, Ensembl uses it for example. It is 1-based. One variation, used on the UCSC browser pages, is to use - instead of '..'. You might want to support both.
ShellfishGene is offline   Reply With Quote
Old 09-02-2010, 12:06 PM   #5
mslider
Junior Member
 
Location: france

Join Date: Sep 2010
Posts: 24
Default another option

maybe usefull to add a parameter to only count the number of sequence...
when you have million of sequence, grep -c "^>" is very low !
mslider is offline   Reply With Quote
Old 09-03-2010, 05:46 AM   #6
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by mslider View Post
maybe usefull to add a parameter to only count the number of sequence...
when you have million of sequence, grep -c "^>" is very low !
I disagree with part of this statement. There are myriad ways to index a fasta and these usually take a few seconds to a few minutes for millions of sequences. Then counting can take seconds. I just used grep to count 2.2 million 454 sequences and it took 13 seconds and did not create any huge index files. I would argue that grep is probably faster than creating an index then counting (in terms of overall time spent), but others may not agree. I agree that returning sequence stats from an index seems natural if you already have the index and it looks like this is on the author's to do list.
SES is offline   Reply With Quote
Old 09-03-2010, 08:57 AM   #7
Lee Sam
Member
 
Location: Ann Arbor, MI

Join Date: Oct 2008
Posts: 57
Default

Thanks for contributing this! It's literally exactly what I need.
Lee Sam is offline   Reply With Quote
Old 09-05-2010, 08:26 AM   #8
avilella
Member
 
Location: uk

Join Date: Mar 2009
Posts: 34
Default

There are two utilities that I am missing in current methods: I am using cdbfasta/cdbyank to index fastq files, but I would like to be able to compress the fastq file so that it takes up less space, even if it means a slower retrieval time. I would also like to be able to send a large number of ids, and retrieve the complement from them: the list of ids in the fastq file but not in the id list.
avilella is offline   Reply With Quote
Old 09-06-2010, 12:36 AM   #9
mslider
Junior Member
 
Location: france

Join Date: Sep 2010
Posts: 24
Default extract just subsequence

If you just want to extract a subsequence from a big sequence like a chromosome,
the program below is more faster and without creating index file:


Code:
#include<iostream>
#include<string.h>
#include<fstream>
#include <stdlib.h>
using namespace std;

 /* Steps:-
1- Download FASTA file and then remove the header. (>asdasfdasfassa)
2- Remove new lines from FASTA file. (using sed or perl)
3- Then you can use the C++  program like this in linux:

./ExtractSequence inputfilename start stop
*/

  int GetIntVal(string strConvert) {
              int intReturn;
              intReturn = atoi(strConvert.c_str());
              return(intReturn);
  }

int main(int argc ,char* argv[]){

       string line1;
	   ifstream myFile(argv[1]);
	   if(! myFile){
	      cout << "Error opening file" << endl;
		  return -1;
	   }
	   while(! myFile.eof()){
	       getline(myFile, line1);

			 string r1 = argv[2];
			 string r2 = argv[3];
			 int range1 = GetIntVal(r1);
			 int range2 =  GetIntVal(r2)- range1;
			 cout << ">Sample Sequence" << endl;
			 cout << line1.substr(range1,range2) << endl;
	   }
	   myFile.close();
    return 0;
}
mslider is offline   Reply With Quote
Old 09-06-2010, 05:14 AM   #10
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

BEDTools' fastaFromBed utility allows you to extract (sub)sequences from a FASTA file using a BED/GFF/VCF file with intervals as input. It also supports strand specific sequence queries so you can extract strand specific features, such as exons.
BEDTools: http://code.google.com/p/bedtools/
Thomas Doktor is offline   Reply With Quote
Old 01-28-2013, 04:10 AM   #11
shaldenby
Junior Member
 
Location: Cambridge, UK

Join Date: Mar 2012
Posts: 1
Default

This is a really useful little tool. Thanks very much!
shaldenby is offline   Reply With Quote
Old 01-29-2013, 03:02 PM   #12
mattanswers
Member
 
Location: Boston

Join Date: Oct 2009
Posts: 65
Default

Thank you ekg for your fastahack tool. The tool seems to extract the sequence by its position in the fasta file. I was wondering if it can extract a provided subsequence from the fasta file, and if so, what if the provided subsequence occurs multiple times in the fasta file ?
mattanswers is offline   Reply With Quote
Old 01-29-2013, 03:07 PM   #13
ekg
Member
 
Location: Boston, MA

Join Date: Apr 2010
Posts: 36
Default

@mattanswers This sounds like a job for your favorite aligner. For short sequences, you can use smith-waterman (https://github.com/ekg/smithwaterman) but for bigger stuff I'd use something like blat or encode your sequences in FASTA and align them.

As for multiple mappings, you'll have to find a mapper that generates them. MOSAIK does, and I believe so does MrsFast.
ekg is offline   Reply With Quote
Old 01-30-2013, 10:25 AM   #14
mattanswers
Member
 
Location: Boston

Join Date: Oct 2009
Posts: 65
Default

Thank you for your help, and also for writing and sharing FastaHack.
mattanswers is offline   Reply With Quote
Reply

Tags
fasta, indexing, manipulation, subsequence extraction, utility

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO