SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
a basic question about coverage maria_mari Bioinformatics 7 01-30-2012 04:12 PM
C programming question arkal Bioinformatics 1 10-24-2011 11:48 PM
basic question about read groups efoss Bioinformatics 2 10-19-2011 05:32 PM
A basic base pair (dna) question ardmore General 12 08-24-2011 10:10 AM
depth of coverage basic question madsaan Bioinformatics 0 03-24-2011 07:40 AM

Reply
 
Thread Tools
Old 07-05-2011, 01:03 PM   #1
henning
Junior Member
 
Location: Gent

Join Date: May 2011
Posts: 7
Default Basic programming question: Very large hash tables

Hi all,

in various applications of RNA-seq analysis I find myself wanting to traverse over all reads in a large fastq file and gather statistics for each read. Typically, I would do this in python by creating a hash table like

import HTSeq

fq = HTSeq.FastqReader(fqfile)
allreads = {}
for i, read in enumerate(fq) :
allreads[read.name] = ([],[]) # something useful here..

however, this takes ages for a large fq file. Are hash tables not the way to go for so large data sets? Assuming I want to keep track of all reads in the file in an efficient manner, what would be a better option?

cheers, henning
henning is offline   Reply With Quote
Old 07-05-2011, 01:13 PM   #2
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

For very large datasets you really need to cache the hash on disk -- in other words a database. In Perl, the mechanism is "tied hashes" and various modules supplying database support.

I'm not schooled enough in Python to know the equivalent, but it must be out there. One avenue to explore would be putting your data into a simple relational schema using SQLite or MySQL; Python must support these but again I'm not knowledgeable about the specific mechanism.
krobison is offline   Reply With Quote
Old 07-05-2011, 05:42 PM   #3
Pseudonym
Research Engineer
 
Location: NICTA VRL, Melbourne, Australia

Join Date: Jun 2011
Posts: 12
Default

This is just a suggestion, but a hash table may be the wrong answer.

Hash tables are key-value dictionaries. In your code snippet, the key is the read name, which means it's unique for every read.

Do you need to randomly access the data by read name later? If not, you don't need to hash on read name.
Pseudonym is offline   Reply With Quote
Old 07-06-2011, 02:27 AM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

FASTQ files can be big, and there might be no way to fit them into memory, no matter what data structure you use.

Hence, one of the main design goals of HTSeq was to make it unnecessary to ever keep all the short reads in a fastq or sam file in memory. If you want to get statistics on your reads, you collect the statistics in the same loop in which your read the file, so that you only need to store your summery statistics and not the whole read. See the "Tour" page in the HTSeq documentation for examples. And don't be afraid to read through the file multiple times; with modern OSs' cached disk I/O this is no performance bottleneck.

You assume you need random-access by name, i.e., that you want to be able to get a specific read when you know its name. First: I wonder what kind of programming task might require this. Maybe explain what you want to do. Second: If you really need it, you still should leave the data on disk and fetch a readon demand, unless you have infinite memory to keep up with a HiSeq's 100 million reads. You need some lightweight data base, such as SQLite, as krobinson suggested. Bindings to sqlite are in Python's standard library ('sqlite3' module). But I very much doubt you need random access via read name.

(By the way: For random access to aligned data in a sam file via position, HTSeq exposes the samtools' indexing function.)

Last edited by Simon Anders; 07-06-2011 at 02:29 AM.
Simon Anders is offline   Reply With Quote
Old 07-06-2011, 05:35 AM   #5
henning
Junior Member
 
Location: Gent

Join Date: May 2011
Posts: 7
Default

Thanks for the very helpful answers.

To provide some more background, I am in the process of trying out different ways to analyse RNAseq including trying out different tools for mapping the reads to my reference genome. In this process, I would like to get statistics showing which read is successfully aligned by which tool to enable understanding of their merits.

My code snippet was part of this attempt where I create a hash table of the identifiers and then assign different statistics to these identifiers, i.e., I do not store the read itself, just info about them. E.g.

1. Build table of all available reads
2. Go through a SAM file from a given read mapper, collect statistics and note them in my table.

Hence, I think I probably do need random access(?). The sqlite option appear a good suggestion and I will try this out, although I am afraid the lookup in my huge table may be too slow..
henning is offline   Reply With Quote
Old 07-06-2011, 08:40 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

An alternative way would be this: Sort all your SAM files by read name, the open them all simultaneously and always take one line from each file. Then, you have (if you use HTSeq) from each file a SAM_Alignment object, all these objects describing the same read. (This needs some extra coding in case a read is missing in one of the SAM files, of course, but that should be easy.)

Now, do your statistic. Let's say you want to know whether there are more reads that aligner A can align but aligner B cannot or vice versa, then you have four counters (number of reads than cannot be aligned by either, that only A can align, that only B can align, that both can align) and you increment the appropriate one. Once you have a hypothesis, like that aligner A has problems with reads with bad quality at the end, you may subdivide each counter into several one, for low middle and good quality, at the beginninh, and at the end, or something like that.

What I mean is this: You want to make a big summary table and the do exploratory statistics in it. But your summary table will be nearly as large as your initial data, and hence, it is more efficient to to the exploration by rerunning through all the SAM file again for each hypothesis you want to explore rather than making an intermediate summary table as one normally would.
Simon Anders is offline   Reply With Quote
Old 07-06-2011, 04:46 PM   #7
Pseudonym
Research Engineer
 
Location: NICTA VRL, Melbourne, Australia

Join Date: Jun 2011
Posts: 12
Default

Yes, I like Simon's solution. Sort and merge are underused algorithms in bioinformatics.
Pseudonym is offline   Reply With Quote
Old 07-08-2011, 07:17 AM   #8
henning
Junior Member
 
Location: Gent

Join Date: May 2011
Posts: 7
Default

Thanks for all the tips, although less flexible, initial experiments tell me that passing many times through the files is the way to go.

cheers
henning is offline   Reply With Quote
Reply

Tags
htseq, python, rnaseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:29 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO