Seqanswers Leaderboard Ad

**krobison** · 07-05-2011, 12:13 PM

For very large datasets you really need to cache the hash on disk -- in other words a database. In Perl, the mechanism is "tied hashes" and various modules supplying database support.

I'm not schooled enough in Python to know the equivalent, but it must be out there. One avenue to explore would be putting your data into a simple relational schema using SQLite or MySQL; Python must support these but again I'm not knowledgeable about the specific mechanism.

**Pseudonym** · 07-05-2011, 04:42 PM

This is just a suggestion, but a hash table may be the wrong answer.

Hash tables are key-value dictionaries. In your code snippet, the key is the read name, which means it's unique for every read.

Do you need to randomly access the data by read name later? If not, you don't need to hash on read name.

**Simon Anders** · 07-06-2011, 01:27 AM

FASTQ files can be big, and there might be no way to fit them into memory, no matter what data structure you use.

Hence, one of the main design goals of HTSeq was to make it unnecessary to ever keep all the short reads in a fastq or sam file in memory. If you want to get statistics on your reads, you collect the statistics in the same loop in which your read the file, so that you only need to store your summery statistics and not the whole read. See the "Tour" page in the HTSeq documentation for examples. And don't be afraid to read through the file multiple times; with modern OSs' cached disk I/O this is no performance bottleneck.

You assume you need random-access by name, i.e., that you want to be able to get a specific read when you know its name. First: I wonder what kind of programming task might require this. Maybe explain what you want to do. Second: If you really need it, you still should leave the data on disk and fetch a readon demand, unless you have infinite memory to keep up with a HiSeq's 100 million reads. You need some lightweight data base, such as SQLite, as krobinson suggested. Bindings to sqlite are in Python's standard library ('sqlite3' module). But I very much doubt you need random access via read name.

(By the way: For random access to aligned data in a sam file via position, HTSeq exposes the samtools' indexing function.)

**henning** · 07-06-2011, 04:35 AM

Thanks for the very helpful answers.

To provide some more background, I am in the process of trying out different ways to analyse RNAseq including trying out different tools for mapping the reads to my reference genome. In this process, I would like to get statistics showing which read is successfully aligned by which tool to enable understanding of their merits.

My code snippet was part of this attempt where I create a hash table of the identifiers and then assign different statistics to these identifiers, i.e., I do not store the read itself, just info about them. E.g.

1. Build table of all available reads
2. Go through a SAM file from a given read mapper, collect statistics and note them in my table.

Hence, I think I probably do need random access(?). The sqlite option appear a good suggestion and I will try this out, although I am afraid the lookup in my huge table may be too slow..

**Simon Anders** · 07-06-2011, 07:40 AM

An alternative way would be this: Sort all your SAM files by read name, the open them all simultaneously and always take one line from each file. Then, you have (if you use HTSeq) from each file a SAM_Alignment object, all these objects describing the same read. (This needs some extra coding in case a read is missing in one of the SAM files, of course, but that should be easy.)

Now, do your statistic. Let's say you want to know whether there are more reads that aligner A can align but aligner B cannot or vice versa, then you have four counters (number of reads than cannot be aligned by either, that only A can align, that only B can align, that both can align) and you increment the appropriate one. Once you have a hypothesis, like that aligner A has problems with reads with bad quality at the end, you may subdivide each counter into several one, for low middle and good quality, at the beginninh, and at the end, or something like that.

What I mean is this: You want to make a big summary table and the do exploratory statistics in it. But your summary table will be nearly as large as your initial data, and hence, it is more efficient to to the exploration by rerunning through all the SAM file again for each hypothesis you want to explore rather than making an intermediate summary table as one normally would.

**Pseudonym** · 07-06-2011, 03:46 PM

Yes, I like Simon's solution. Sort and merge are underused algorithms in bioinformatics.

**henning** · 07-08-2011, 06:17 AM

Thanks for all the tips, although less flexible, initial experiments tell me that passing many times through the files is the way to go.

cheers

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Basic programming question: Very large hash tables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News