Hi all,
in various applications of RNA-seq analysis I find myself wanting to traverse over all reads in a large fastq file and gather statistics for each read. Typically, I would do this in python by creating a hash table like
import HTSeq
fq = HTSeq.FastqReader(fqfile)
allreads = {}
for i, read in enumerate(fq) :
allreads[read.name] = ([],[]) # something useful here..
however, this takes ages for a large fq file. Are hash tables not the way to go for so large data sets? Assuming I want to keep track of all reads in the file in an efficient manner, what would be a better option?
cheers, henning
in various applications of RNA-seq analysis I find myself wanting to traverse over all reads in a large fastq file and gather statistics for each read. Typically, I would do this in python by creating a hash table like
import HTSeq
fq = HTSeq.FastqReader(fqfile)
allreads = {}
for i, read in enumerate(fq) :
allreads[read.name] = ([],[]) # something useful here..
however, this takes ages for a large fq file. Are hash tables not the way to go for so large data sets? Assuming I want to keep track of all reads in the file in an efficient manner, what would be a better option?
cheers, henning
Comment