Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Basic programming question: Very large hash tables

    Hi all,

    in various applications of RNA-seq analysis I find myself wanting to traverse over all reads in a large fastq file and gather statistics for each read. Typically, I would do this in python by creating a hash table like

    import HTSeq

    fq = HTSeq.FastqReader(fqfile)
    allreads = {}
    for i, read in enumerate(fq) :
    allreads[read.name] = ([],[]) # something useful here..

    however, this takes ages for a large fq file. Are hash tables not the way to go for so large data sets? Assuming I want to keep track of all reads in the file in an efficient manner, what would be a better option?

    cheers, henning

  • #2
    For very large datasets you really need to cache the hash on disk -- in other words a database. In Perl, the mechanism is "tied hashes" and various modules supplying database support.

    I'm not schooled enough in Python to know the equivalent, but it must be out there. One avenue to explore would be putting your data into a simple relational schema using SQLite or MySQL; Python must support these but again I'm not knowledgeable about the specific mechanism.

    Comment


    • #3
      This is just a suggestion, but a hash table may be the wrong answer.

      Hash tables are key-value dictionaries. In your code snippet, the key is the read name, which means it's unique for every read.

      Do you need to randomly access the data by read name later? If not, you don't need to hash on read name.
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});

      Comment


      • #4
        FASTQ files can be big, and there might be no way to fit them into memory, no matter what data structure you use.

        Hence, one of the main design goals of HTSeq was to make it unnecessary to ever keep all the short reads in a fastq or sam file in memory. If you want to get statistics on your reads, you collect the statistics in the same loop in which your read the file, so that you only need to store your summery statistics and not the whole read. See the "Tour" page in the HTSeq documentation for examples. And don't be afraid to read through the file multiple times; with modern OSs' cached disk I/O this is no performance bottleneck.

        You assume you need random-access by name, i.e., that you want to be able to get a specific read when you know its name. First: I wonder what kind of programming task might require this. Maybe explain what you want to do. Second: If you really need it, you still should leave the data on disk and fetch a readon demand, unless you have infinite memory to keep up with a HiSeq's 100 million reads. You need some lightweight data base, such as SQLite, as krobinson suggested. Bindings to sqlite are in Python's standard library ('sqlite3' module). But I very much doubt you need random access via read name.

        (By the way: For random access to aligned data in a sam file via position, HTSeq exposes the samtools' indexing function.)
        Last edited by Simon Anders; 07-06-2011, 01:29 AM.

        Comment


        • #5
          Thanks for the very helpful answers.

          To provide some more background, I am in the process of trying out different ways to analyse RNAseq including trying out different tools for mapping the reads to my reference genome. In this process, I would like to get statistics showing which read is successfully aligned by which tool to enable understanding of their merits.

          My code snippet was part of this attempt where I create a hash table of the identifiers and then assign different statistics to these identifiers, i.e., I do not store the read itself, just info about them. E.g.

          1. Build table of all available reads
          2. Go through a SAM file from a given read mapper, collect statistics and note them in my table.

          Hence, I think I probably do need random access(?). The sqlite option appear a good suggestion and I will try this out, although I am afraid the lookup in my huge table may be too slow..

          Comment


          • #6
            An alternative way would be this: Sort all your SAM files by read name, the open them all simultaneously and always take one line from each file. Then, you have (if you use HTSeq) from each file a SAM_Alignment object, all these objects describing the same read. (This needs some extra coding in case a read is missing in one of the SAM files, of course, but that should be easy.)

            Now, do your statistic. Let's say you want to know whether there are more reads that aligner A can align but aligner B cannot or vice versa, then you have four counters (number of reads than cannot be aligned by either, that only A can align, that only B can align, that both can align) and you increment the appropriate one. Once you have a hypothesis, like that aligner A has problems with reads with bad quality at the end, you may subdivide each counter into several one, for low middle and good quality, at the beginninh, and at the end, or something like that.

            What I mean is this: You want to make a big summary table and the do exploratory statistics in it. But your summary table will be nearly as large as your initial data, and hence, it is more efficient to to the exploration by rerunning through all the SAM file again for each hypothesis you want to explore rather than making an intermediate summary table as one normally would.

            Comment


            • #7
              Yes, I like Simon's solution. Sort and merge are underused algorithms in bioinformatics.
              sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});

              Comment


              • #8
                Thanks for all the tips, although less flexible, initial experiments tell me that passing many times through the files is the way to go.

                cheers

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                57 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                56 views
                0 likes
                Last Post seqadmin  
                Working...
                X