Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
sff files, fasta and fastq Feenix 454 Pyrosequencing 4 06-26-2014 06:43 AM
Can Biopython parse fastq file? ardmore Bioinformatics 2 11-29-2011 03:43 PM
Fastq to Fasta ardmore Bioinformatics 6 11-17-2011 06:56 AM
converting consensus fastq to fasta zlu Bioinformatics 18 08-17-2011 10:11 AM
fastq to fasta conversion kwtennis311 Bioinformatics 4 06-11-2010 12:06 PM

Thread Tools
Old 07-09-2012, 10:43 AM   #1
Location: Arnhem

Join Date: Feb 2012
Posts: 16
Default fastest way to 'parse' fasta or fastq?

I am looking for a real (High performace computing / HPC) fast fasta or fastq parsing program. I just want the most simple statistics imaginable:
- number of reads
- total nr of bases.
Other stuff like average length/ATCG composition is nice, but not required.

I searched the software page, tried some packages, wrote my own parsers but they are all slow.
I am looking for something in C code, which can be super fast I hope.
I also tried this simple bash code:
" time grep -v '^>' ./test.fa | wc -m -l"

which is 'fast' ( 30 seconds to scan 1 GB fasta (file in memory)
My simple python script takes over a minute to scan this file. But I hope this can be done faster, or all in one script.

If you want to scan gigabytes of files, it would be nice to have a very fast parser.

Anyone who is aware of such program? Or, what do you think is the fastest program you know?
HenrivdGeest is offline   Reply With Quote
Old 07-09-2012, 11:24 AM   #2
Location: Wisconsin

Join Date: Jun 2011
Posts: 87


You could try the FASTQC package if you haven't already. It can take fastq/bam/sam files and gives most of the important statistics for a NGS run.
aggp11 is offline   Reply With Quote
Old 07-09-2012, 11:32 AM   #3
Location: cinci

Join Date: Apr 2010
Posts: 66

I suggest using native linux tools such as grep, sed, awk in multithreaded environment also 64 bit may be useful in some applications where it is supported. There is option of using CUDA with GPU to do super fast calculations.
husamia is offline   Reply With Quote
Old 07-09-2012, 12:17 PM   #4
Senior Member
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381

JackieBadger is offline   Reply With Quote
Old 07-09-2012, 12:36 PM   #5
Richard Finney
Senior Member
Location: bethesda

Join Date: Feb 2009
Posts: 700

If you're up for moding a couple of lines of code for your needs
this should do the trick ...
#include <stdio.h>
#include <string.h>
#include <ctype.h>
unsigned long int sum[5];
unsigned long int basecount;
unsigned long int readcount = 0;
char s[512];
int main()
    register int i,j;
    char ch;
    basecount = 0;
    while (gets(s))
        if (s[0] == '>') continue; // skip fasta entry header
        for (i=0;i<s[i];i++)
            ch = toupper(s[i]);
            if (ch == 'A') { sum[0]++; basecount++; }
            else if (ch == 'C') { sum[1]++; basecount++; }
            else if (ch == 'G') { sum[2]++; basecount++; }
            else if (ch == 'T') { sum[3]++; basecount++; }
            else if (ch == 'N') { sum[4]++; basecount++; }
    for (j=0;j<5;j++)
        if (j == 0) printf("A ");
        else if (j == 1) printf("C ");
        else if (j == 2) printf("G ");
        else if (j == 3) printf("T ");
        else if (j == 4) printf("N ");
        printf("%ld ",sum[j]);
    printf("bases = %ld \n",basecount);
    printf("reads = %ld \n",readcount);
    return 0;
Richard Finney is offline   Reply With Quote
Old 07-09-2012, 01:20 PM   #6
Peter (Biopython etc)
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543

If you don't want error checking Heng Li has a very fast FASTA/FASTQ parser in C which could easily be used for the basic information you requested (read count and total bases):
maubp is offline   Reply With Quote

fasta statistics contents

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 07:57 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO