![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
sff files, fasta and fastq | Feenix | 454 Pyrosequencing | 4 | 06-26-2014 06:43 AM |
Can Biopython parse fastq file? | ardmore | Bioinformatics | 2 | 11-29-2011 03:43 PM |
Fastq to Fasta | ardmore | Bioinformatics | 6 | 11-17-2011 06:56 AM |
converting consensus fastq to fasta | zlu | Bioinformatics | 18 | 08-17-2011 10:11 AM |
fastq to fasta conversion | kwtennis311 | Bioinformatics | 4 | 06-11-2010 12:06 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Arnhem Join Date: Feb 2012
Posts: 16
|
![]()
I am looking for a real (High performace computing / HPC) fast fasta or fastq parsing program. I just want the most simple statistics imaginable:
- number of reads - total nr of bases. Other stuff like average length/ATCG composition is nice, but not required. I searched the software page, tried some packages, wrote my own parsers but they are all slow. I am looking for something in C code, which can be super fast I hope. I also tried this simple bash code: " time grep -v '^>' ./test.fa | wc -m -l" which is 'fast' ( 30 seconds to scan 1 GB fasta (file in memory) My simple python script takes over a minute to scan this file. But I hope this can be done faster, or all in one script. If you want to scan gigabytes of files, it would be nice to have a very fast parser. Anyone who is aware of such program? Or, what do you think is the fastest program you know? |
![]() |
![]() |
![]() |
#2 |
Member
Location: Wisconsin Join Date: Jun 2011
Posts: 87
|
![]()
Hi,
You could try the FASTQC package if you haven't already. It can take fastq/bam/sam files and gives most of the important statistics for a NGS run. |
![]() |
![]() |
![]() |
#3 |
Member
Location: cinci Join Date: Apr 2010
Posts: 66
|
![]()
I suggest using native linux tools such as grep, sed, awk in multithreaded environment also 64 bit may be useful in some applications where it is supported. There is option of using CUDA with GPU to do super fast calculations.
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Halifax, Nova Scotia Join Date: Mar 2009
Posts: 381
|
![]()
PRINSEQ and FASTQC
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: bethesda Join Date: Feb 2009
Posts: 700
|
![]()
If you're up for moding a couple of lines of code for your needs
this should do the trick ... Code:
#include <stdio.h> #include <string.h> #include <ctype.h> unsigned long int sum[5]; unsigned long int basecount; unsigned long int readcount = 0; char s[512]; int main() { register int i,j; char ch; basecount = 0; memset(sum,0,sizeof(sum)); while (gets(s)) { if (s[0] == '>') continue; // skip fasta entry header readcount++; for (i=0;i<s[i];i++) { ch = toupper(s[i]); if (ch == 'A') { sum[0]++; basecount++; } else if (ch == 'C') { sum[1]++; basecount++; } else if (ch == 'G') { sum[2]++; basecount++; } else if (ch == 'T') { sum[3]++; basecount++; } else if (ch == 'N') { sum[4]++; basecount++; } } memset(s,0,sizeof(s)); } for (j=0;j<5;j++) { if (j == 0) printf("A "); else if (j == 1) printf("C "); else if (j == 2) printf("G "); else if (j == 3) printf("T "); else if (j == 4) printf("N "); printf("%ld ",sum[j]); printf("\n"); } printf("bases = %ld \n",basecount); printf("reads = %ld \n",readcount); return 0; } |
![]() |
![]() |
![]() |
#6 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
If you don't want error checking Heng Li has a very fast FASTA/FASTQ parser in C which could easily be used for the basic information you requested (read count and total bases):
http://lh3lh3.users.sourceforge.net/parsefastq.shtml |
![]() |
![]() |
![]() |
Tags |
fasta statistics contents |
Thread Tools | |
|
|