Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • What are today's computational bottlenecks?

    I have some interesting questions about practical bioinformatics bottlenecks.. what common or important bioinformatic tasks are the slowest or most limiting?
    My background is in text compression, databases, and parallel processing, but I know basic bionformatic algorithms as well, since they're so related.


    But what I don't know is what, in practice, is the bottleneck for most bioinformatic users is. I'm hoping to try to focus some new research (using cheap PC graphics cards) to bioinformatic algorithms and I want to work on speeding up the tasks that are truly a problem. Most kinds of applications can be sped up 10-100 times, so it's worthwhile to try to get some of these apps working in practice. But which ones should be sped up first? Probably the tasks that are both common AND slow.

    I'd love if you'd share your opinions, experiences, even HOPES for what kind of tools or speedups or new abilities you'd like. Again I've got a good technical background, but not an idea of what tasks are still making actual biologists grind their teeth in frustration.

    Some examples:
    • Is BLAST alignment speed an issue? Would you yell in joy if there was a new tool that gave identical results and was twice as fast?
    • Are you happy with BLAST but just want to do much bigger alignments? Stuff like "Here's my 1M nucleotide sequence, give me the top 1000 local alignments", and get an answer in 10 seconds?
    • Or maybe database searching? You want to say "here's 10000 nucleotides, please search every genome in GenBank and give me the best hits from everything! In 2 seconds like a Google search does!!"
    • Or de novo assembly? Is is a huge problem to get shotgun fragments and have to use a zillion CPU hours to assemble a genome?
    • Or maybe you always run a Smith-Waterman high quality alignment as a double check, and that take a week to cook and it really becomes a big issue?


    Those are just examples right from the top of my head, and I don't know if those are actually abilities that are despirately desired. Or of course there's likely tasks I haven't even heard of that are a limitation.. please teach me.

    Again what I'm really trying to understand is what computational tasks are common, but TOO SLOW. Or too size limited (maybe BLAST is fast for you, but only because you use short sequences because long ones you'd rather use are too slow.)

    I appreciate any suggestions or stories or pleas... links to other forums that might help me learn as well, any feedback.
    I'll be happy to discuss what algorithms modern hardware can help with as well. You may be surprised.

    Thanks!

  • #2
    GerryB

    Originally posted by GerryB View Post
    I'd love if [b]you'd share your opinions, experiences, even HOPES for what kind of tools or speedups or new abilities you'd like.
    Sequence similarity searching is always something we wish was faster. Query sets and databases are always getting larger. The topic that is neglected often is that of PROTEIN-PROTEIN similarity searches - their simpler DNA-DNA conterparts get most of the attention.

    DNA-DNA is simpler, has more scope for indexing due to smaller alphabet, and is more redundant/compressible. PROTEIN on the other hand is a larger alphabet and close to incompressible, so many of the speed-up tricks don't work. The whole BLOSUM/PAM similarity matrix business also means PROTEIN searches have a larger constant factor in their time complexities.

    I'm guessing you are looking at GPU implementations. You should look at HMMER. The current alpha version is poised to support SMP (multipile threads), MPI (clusters), SIMD (vector registers), GPU (CUDA?) and CellSPU (Sony PS3 etc). HMMER is a form of PROT-PROT searching using profile HMMs, and it can never be fast enough :-)

    Torst

    Comment


    • #3
      Creating an auto-tuning bowtie (or any other good alogrithms in genomics) may be a good idea since many of biologists only use cheap PCs to do the data crunching (the next generation sequencing). GPGPU is also a good idea, such as MUMmerGPU and "SOAP on GPU project" in UCSD (http://iacs5.ucsd.edu/~tpremchi/index.html). With the advanced in the next generation sequencing or even the next next generation one, we will soon be overwhelmed by the data deluge.

      Comment


      • #4
        Originally posted by Torst View Post
        Sequence similarity searching is always something we wish was faster. Query sets and databases are always getting larger. The topic that is neglected often is that of PROTEIN-PROTEIN similarity searches - their simpler DNA-DNA conterparts get most of the attention.

        DNA-DNA is simpler, has more scope for indexing due to smaller alphabet, and is more redundant/compressible. PROTEIN on the other hand is a larger alphabet and close to incompressible, so many of the speed-up tricks don't work. The whole BLOSUM/PAM similarity matrix business also means PROTEIN searches have a larger constant factor in their time complexities.

        This is exactly the kind of information I'm hoping to learn from.
        When you say that protein-protein searches are neglected and need to be sped up, do you mean BLAST-P like searches? Or do you mean HMM holology search? You mention HMMER, and that's a great hint to me as well (especially since I've been reading their book these past two weeks!)

        What's the specific tool or search you'd like to see? Again, you can go a little wild and blue-sky dream.. My background is with huge terabyte databases with subsecond fuzzy searches. HMM style alignment is a very complex form of "fuzzy" but it's still related. Are you saying you'd just love a hmmsort program that ran 50 times faster and you'd scream in joy and email me your firstborn child? Or do you mean that P-P aligment and motif finding needs more research and better databases and "more love" like the DNA boys get?

        What's interesting to me about protein is (again, biased by my background), the larger alphabet and larger scoring matrices don't seem like they'd be a big algorithmic change. Even the DNA tools I've been experimenting with are generic and deal with alphabets up to 255.
        Now I realize of course I am still very ignorant of some of the practical aspects, but that's also why I'm really enjoying hearing these answers and helping to guide my own experiments (and even more papers to study.)

        Comment


        • #5
          Originally posted by xuying View Post
          Creating an auto-tuning bowtie (or any other good alogrithms in genomics) may be a good idea since many of biologists only use cheap PCs to do the data crunching (the next generation sequencing). GPGPU is also a good idea, such as MUMmerGPU and "SOAP on GPU project" in UCSD (http://iacs5.ucsd.edu/~tpremchi/index.html). With the advanced in the next generation sequencing or even the next next generation one, we will soon be overwhelmed by the data deluge.
          Now this is especially interesting since Bowtie alignment mapping is extremely similar to some of the database throughput issues I've dealt with before.

          Bowtie's throughput can match over 30 million sequences an hour with only 1GB of RAM. Is this a big limitation? (Again, this is a serious question..)
          If you have a 454 machine sitting next to you spitting out sequences, how fast does it give them to you? It seems like it can spit out about 100,000 an hour.
          So my dumb question (again, laugh at my practical ignorance) is why you'd need bowtie-like short-read alignment to be much faster?

          Maybe you like doing an alignment in a few hundred different ways with different fitting criteria to see if your matches are sensitive to settings?

          So my question is how much you use Bowtie/SOAP like alignment programs, and how their speed affects your workflow. If your lab has 20 PCs that run continuously doing these alignments, I'm really curious why! It seems to me that the sequencing itself would be the bottleneck, and a desktop PC could keep up with the data pretty easily.
          I'm sure I'm wrong about that (or you wouldn't have talked about it) so I'm hoping to learn where my misconceptions are.

          Thanks!

          Comment


          • #6
            Yes, sometimes we want to shuffle reads many times and map back to genome to calculate false positive rate of mapping. I am also curious how fast different kinds of algorithms can be if you implement them on different platforms (GPU or Cell). Mapping is just the first stage in analysis pipeline. When reads are getting longer, we do need faster aligners allowing more mismatches. Maybe in the near future, we will switch back to BLAST-like tools again. But concerning the number of reads, a new faster bowtie-like memory-efficient program might be more suitable for us.

            Comment


            • #7
              Originally posted by xuying View Post
              Yes, sometimes we want to shuffle reads many times and map back to genome to calculate false positive rate of mapping. I am also curious how fast different kinds of algorithms can be if you implement them on different platforms (GPU or Cell). Mapping is just the first stage in analysis pipeline. When reads are getting longer, we do need faster aligners allowing more mismatches. Maybe in the near future, we will switch back to BLAST-like tools again. But concerning the number of reads, a new faster bowtie-like memory-efficient program might be more suitable for us.
              This is again great information for me, thanks!
              I'm really impressed by Bowtie.. they have a smooth implementation and exactly the right clever algorithm for exact and near-exact matches.

              Again I'm surprised that bowtie wouldn't be fast enough for anyone though! But maybe that goes back to your desire for detecting false-positives. Perhaps what you'd rather have is a bowtie like matching with better statistics about confidence levels for which sequences are properly matched, and/or how accurate your final assembled genome is.

              Just brainstorming here but I could imagine doing a sequence assembly "in real time" as the sequencer runs... you may even decide to run the sequencer for a shorter or longer time as the assembly reports the "realtime" quality of the final assembed sequence.

              With respect to mismatch support, are most errors in new generation sequencers just single letter errors? Do indels ever even occur? Or are the ends of the sequences always the problem, which is why they're often stripped. (Maybe that's why you like running matches several times, to see if your stripping affected the final assembly?)

              Bowtie already handles a few mismatches, but is it useful to allow a lot more generous error rate, like 10 or 20 per sequence? Or are you ideally hoping for a BLAST like alignment with gaps and multiple errors, or is that overkill for these sequence reads (I kind of expected that you never had gaps in the short read techniques.. am I wrong?)

              Comment


              • #8
                You are right. Statistical support in aligning short reads is good. But we dont have too many choices at this moment. Different sequencing technology has different error model. Some will introduce indel while others produce more single nucleotide mismatches depending on how they detect signals. People trim reads because of the quality of the reads as well as different applications (for smRNAs 3' adaptor should be trimmed). We deal with 36bp reads without considering gaps (for solexa data it's fine I think). But even for longer reads, we wont go crazy for dozens of mismatches. Continuous gaps may be more important. That involves much more computation, right?

                Comment


                • #9
                  Yes, gaps take more searching, but are not too much harder than single errors.
                  What becomes trickier is higher error rate (like 5%).. there are great algorithms for handling that, but I don't think such error rates are common in sequencing, are they?

                  About gaps: Are short sequence gaps always near each other? Maybe you have 20bp, then a gap of 13, then the next 40 bp kind of behavor? Or are they totally unrelated because different DNA loops rejoined, so you may have a 20bp match, a gap of 376271 all the way over to a different chromosome, then 40 more bp match?

                  In the "wouldn't it be nice" design category, what you'd like to see is a high-throughput short sequence aligner. One that probably matches against a base framework of a known genome (but probably you'd also like the ability for raw assembly too, with no starting framework?). You'd like to allow errors.. what rates are reasonable? 1%? 5%? You'd like to have gap support. You'd like to use untrimmed sequences, but you want the aligner to "know" the ends may be bad and not weight them if they're not consistent. You'd like some statistical feedback over how good the final assembly is.. maybe also about how good any particular sequence really matched is.

                  What else? This is all so interesting to me..

                  If you did have such a tool, how would it change your daily science? Is it all just "nice to have" or would this be a "wow, we can do so much more now, we need to buy more sequencers because it's so useful!"

                  Stupid side question: how much do modern sequencers even cost? If I wanted my own 454 machine. They spit out 100,000 matches an hour, does that mean I could sample my own genome with 10X coverage in about a week? (hmm, obviously not or the genomic xprize would have been won long ago)

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    Yesterday, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  58 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  54 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  46 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  55 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X