Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by colindaven View Post
    This might be useful in metagenomics (as kopi-o mentioned) if it provided accurate output in the standard
    SAM alignment format.

    Certainly I have problems fitting enough microbial genomes into the 4GB reference sequence limitation of Bowtie and BWA, given there are now > 6-7 GB which are publicly available.

    The most recent bwa was supposed to remove the 4GB limit.

    Comment


    • #17
      Lots of good advice!

      Thanks to all for the thoughtful comments. Replies follow. I should make one significant correction though. The actual times I quoted for finding a multi-kilobyte target in C. Elegans and the human genome were egregiously in error. I added two decimal points. The actual times were 1/5 second for C Elegans and 6 seconds for the human genome, not 600 seconds. This suggests that for this particular kind of search, it is a good order of magnitude faster than BWA-SW, rather than a few times slower. (Whether that matters is another question. you do still have to do apply a real alignment algorithm to the resulting match.)

      Colindaven and kopi-o: metagenomics sounds very much worth investigating. I'll certainly look into this.

      nilshomer's and rskr's comments on memory and caching require a detailed response. Algorithm that use large amounts of memory interact with the OS and hardware in non-obvious ways. As nilshomer was pointed out, hierarchical and random access data structures tend to have low locality of reference, defeating much of the magic of the TLB and the various hardware caches. Although, with this algorithm, the entire data structure for say, the human genome, would easily fit in the hardware caches, it turns out to be largely irrelevant. With sequential access, hardware executing off to the side generally takes care of populating hardware caches in advance of requirement. Possibly equally importantly, the TLB's also experiences fewer faults because the input data is in order, allowing the circuitry that anticipates your next access to guess correctly more often. Moreover, sequentially accessed data almost always has a very low page fault rate, as every byte on each page is fully used. So with a fully sequential access, the average time-cost of reading a byte of memory can be orders of magnitude lower than in an algorithm using hashing or hierarchical data structures.

      nilshomers comments about storing only diffs of highly similar genomes, can't be gainsaid. I was thinking of the case where you were looking, say, for matches among distant taxa, which may be a much less common problem.

      Anyway, thanks again for the considerable time and thought you folks have put into answering these questions. Maybe it would make sense to take this input and try come up with an interesting demo---download a couple of dozen genomes and demonstrate finding some obscure gene in a few seconds.

      Meanwhile, if anybody comes across any useful applications, either for searching entire genomes, or for rapidly estimating the Levenshtein distance of fragments in the kilobyte to megabyte range, I'd love to hear about it. It's gotta be good for something! I can also be reached as coatespt at g mail.

      Comment


      • #18
        Originally posted by rskr View Post
        The most recent bwa was supposed to remove the 4GB limit.
        It does handle 4GB or greater!

        Comment


        • #19
          Originally posted by nilshomer View Post
          One 5Kb sequence to the human genome in 10 minutes? Try "bwa-sw" or "bowtie2" as a benchmark.
          For reference, the blasr method aligns 10kb 87% accurate sequences to the human genome in about <0.5s per sequence.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Advancing Precision Medicine for Rare Diseases in Children
            by seqadmin




            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
            12-16-2024, 07:57 AM
          • seqadmin
            Recent Advances in Sequencing Technologies
            by seqadmin



            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

            Long-Read Sequencing
            Long-read sequencing has seen remarkable advancements,...
            12-02-2024, 01:49 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 12-17-2024, 10:28 AM
          0 responses
          26 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-13-2024, 08:24 AM
          0 responses
          42 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-12-2024, 07:41 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-11-2024, 07:45 AM
          0 responses
          42 views
          0 likes
          Last Post seqadmin  
          Working...
          X