Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by colindaven View Post
    This might be useful in metagenomics (as kopi-o mentioned) if it provided accurate output in the standard
    SAM alignment format.

    Certainly I have problems fitting enough microbial genomes into the 4GB reference sequence limitation of Bowtie and BWA, given there are now > 6-7 GB which are publicly available.

    The most recent bwa was supposed to remove the 4GB limit.

    Comment


    • #17
      Lots of good advice!

      Thanks to all for the thoughtful comments. Replies follow. I should make one significant correction though. The actual times I quoted for finding a multi-kilobyte target in C. Elegans and the human genome were egregiously in error. I added two decimal points. The actual times were 1/5 second for C Elegans and 6 seconds for the human genome, not 600 seconds. This suggests that for this particular kind of search, it is a good order of magnitude faster than BWA-SW, rather than a few times slower. (Whether that matters is another question. you do still have to do apply a real alignment algorithm to the resulting match.)

      Colindaven and kopi-o: metagenomics sounds very much worth investigating. I'll certainly look into this.

      nilshomer's and rskr's comments on memory and caching require a detailed response. Algorithm that use large amounts of memory interact with the OS and hardware in non-obvious ways. As nilshomer was pointed out, hierarchical and random access data structures tend to have low locality of reference, defeating much of the magic of the TLB and the various hardware caches. Although, with this algorithm, the entire data structure for say, the human genome, would easily fit in the hardware caches, it turns out to be largely irrelevant. With sequential access, hardware executing off to the side generally takes care of populating hardware caches in advance of requirement. Possibly equally importantly, the TLB's also experiences fewer faults because the input data is in order, allowing the circuitry that anticipates your next access to guess correctly more often. Moreover, sequentially accessed data almost always has a very low page fault rate, as every byte on each page is fully used. So with a fully sequential access, the average time-cost of reading a byte of memory can be orders of magnitude lower than in an algorithm using hashing or hierarchical data structures.

      nilshomers comments about storing only diffs of highly similar genomes, can't be gainsaid. I was thinking of the case where you were looking, say, for matches among distant taxa, which may be a much less common problem.

      Anyway, thanks again for the considerable time and thought you folks have put into answering these questions. Maybe it would make sense to take this input and try come up with an interesting demo---download a couple of dozen genomes and demonstrate finding some obscure gene in a few seconds.

      Meanwhile, if anybody comes across any useful applications, either for searching entire genomes, or for rapidly estimating the Levenshtein distance of fragments in the kilobyte to megabyte range, I'd love to hear about it. It's gotta be good for something! I can also be reached as coatespt at g mail.

      Comment


      • #18
        Originally posted by rskr View Post
        The most recent bwa was supposed to remove the 4GB limit.
        It does handle 4GB or greater!

        Comment


        • #19
          Originally posted by nilshomer View Post
          One 5Kb sequence to the human genome in 10 minutes? Try "bwa-sw" or "bowtie2" as a benchmark.
          For reference, the blasr method aligns 10kb 87% accurate sequences to the human genome in about <0.5s per sequence.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          45 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          46 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          39 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X