SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
sff files, fasta and fastq Feenix 454 Pyrosequencing 4 06-26-2014 05:43 AM
Can Biopython parse fastq file? ardmore Bioinformatics 2 11-29-2011 02:43 PM
Fastq to Fasta ardmore Bioinformatics 6 11-17-2011 05:56 AM
converting consensus fastq to fasta zlu Bioinformatics 18 08-17-2011 09:11 AM
fastq to fasta conversion kwtennis311 Bioinformatics 4 06-11-2010 11:06 AM

Reply
 
Thread Tools
Old 07-09-2012, 09:43 AM   #1
HenrivdGeest
Member
 
Location: Arnhem

Join Date: Feb 2012
Posts: 16
Default fastest way to 'parse' fasta or fastq?

I am looking for a real (High performace computing / HPC) fast fasta or fastq parsing program. I just want the most simple statistics imaginable:
- number of reads
- total nr of bases.
Other stuff like average length/ATCG composition is nice, but not required.

I searched the software page, tried some packages, wrote my own parsers but they are all slow.
I am looking for something in C code, which can be super fast I hope.
I also tried this simple bash code:
" time grep -v '^>' ./test.fa | wc -m -l"

which is 'fast' ( 30 seconds to scan 1 GB fasta (file in memory)
My simple python script takes over a minute to scan this file. But I hope this can be done faster, or all in one script.


If you want to scan gigabytes of files, it would be nice to have a very fast parser.

Anyone who is aware of such program? Or, what do you think is the fastest program you know?
HenrivdGeest is offline   Reply With Quote
Old 07-09-2012, 10:24 AM   #2
aggp11
Member
 
Location: Wisconsin

Join Date: Jun 2011
Posts: 87
Default

Hi,

You could try the FASTQC package if you haven't already. It can take fastq/bam/sam files and gives most of the important statistics for a NGS run.
aggp11 is offline   Reply With Quote
Old 07-09-2012, 10:32 AM   #3
husamia
Member
 
Location: cinci

Join Date: Apr 2010
Posts: 66
Default

I suggest using native linux tools such as grep, sed, awk in multithreaded environment also 64 bit may be useful in some applications where it is supported. There is option of using CUDA with GPU to do super fast calculations.
husamia is offline   Reply With Quote
Old 07-09-2012, 11:17 AM   #4
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

PRINSEQ and FASTQC
JackieBadger is offline   Reply With Quote
Old 07-09-2012, 11:36 AM   #5
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

If you're up for moding a couple of lines of code for your needs
this should do the trick ...
Code:
#include <stdio.h>
#include <string.h>
#include <ctype.h>
unsigned long int sum[5];
unsigned long int basecount;
unsigned long int readcount = 0;
char s[512];
int main()
{
    register int i,j;
    char ch;
    basecount = 0;
    memset(sum,0,sizeof(sum));
    while (gets(s))
    {
        if (s[0] == '>') continue; // skip fasta entry header
        readcount++;
        for (i=0;i<s[i];i++)
        {
            ch = toupper(s[i]);
            if (ch == 'A') { sum[0]++; basecount++; }
            else if (ch == 'C') { sum[1]++; basecount++; }
            else if (ch == 'G') { sum[2]++; basecount++; }
            else if (ch == 'T') { sum[3]++; basecount++; }
            else if (ch == 'N') { sum[4]++; basecount++; }
        }
        memset(s,0,sizeof(s));
    }
    for (j=0;j<5;j++)
    {
        if (j == 0) printf("A ");
        else if (j == 1) printf("C ");
        else if (j == 2) printf("G ");
        else if (j == 3) printf("T ");
        else if (j == 4) printf("N ");
        printf("%ld ",sum[j]);
        printf("\n");
    }
    printf("bases = %ld \n",basecount);
    printf("reads = %ld \n",readcount);
    return 0;
}
Richard Finney is offline   Reply With Quote
Old 07-09-2012, 12:20 PM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

If you don't want error checking Heng Li has a very fast FASTA/FASTQ parser in C which could easily be used for the basic information you requested (read count and total bases):
http://lh3lh3.users.sourceforge.net/parsefastq.shtml
maubp is offline   Reply With Quote
Reply

Tags
fasta statistics contents

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:23 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO