Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • minimal read length accepted by Bowtie

    Hi all,

    I wonder if anyone knows the minimal read length accepted by Bowtie. Basically, I have a set of short motif sequences (7mers) and want to see where they map to the mouse reference genome. I tried Bowtie, but it seems to not work because of the short read length (7 bp).

    Any suggestions will be very much appreciated!

    Thanks,
    Yue

  • #2
    same question for bwa

    Comment


    • #3
      Because a short sequence like 7 bases would map all over the place, it's very unlikely that any read aligner will handle it properly. The algorithms they use are mostly designed to handle sequences no shorter than the shortest reads that come from Illumina sequencers (32bp I think).

      The good news is that since you're looking for a relatively small number of specific 7-base sequences without gaps or mismatches, a simple string search should be able to do it for you. A Python or Perl script could just loop over every line in the reference genome and print out any location where it finds one of the matching strings. If you have no idea how to code one, let me know and I'll write you one when I have a few spare minutes.

      Comment


      • #4
        Originally posted by Rocketknight View Post
        Because a short sequence like 7 bases would map all over the place, it's very unlikely that any read aligner will handle it properly. The algorithms they use are mostly designed to handle sequences no shorter than the shortest reads that come from Illumina sequencers (32bp I think).

        The good news is that since you're looking for a relatively small number of specific 7-base sequences without gaps or mismatches, a simple string search should be able to do it for you. A Python or Perl script could just loop over every line in the reference genome and print out any location where it finds one of the matching strings. If you have no idea how to code one, let me know and I'll write you one when I have a few spare minutes.
        Hi Rocketknight,

        Thanks a lot for your reply. I actually managed to get Bowtie working on the short 7mer with a few additional options. The tricky thing of writing a script to do it is that the alignment does not need to be exact (i.e. 2 mismatches somewhere in that 7mer are allowed).

        Comment


        • #5
          You're going to get a huge amount of matches if you search a large genome with those parameters (by my back-of-the-envelope calculations, a 7bp string with two allowed mismatches will hit by chance more than 0.1% of the time in a statistically average genome). In other words, for a 1GB genome, you should be seeing over one million matches for each 7-mer on average. Does Bowtie really report all of those matches?

          Edit: If it doesn't, all isn't lost - it's definitely possible to write a string-searcher with mismatching in Python (though I give no guarantees about running time). I'm willing to help if you're stuck, it sounds like an interesting problem.

          Extra edit: Whoops, mistake with my calculations. You should expect a random hit rate as high as about 0.45%. For the mouse genome (~3GB) you should expect to see around 13-14 million hits per 7-mer by chance.
          Last edited by Rocketknight; 04-05-2012, 03:28 AM.

          Comment


          • #6
            Originally posted by Rocketknight View Post
            ... it's definitely possible to write a string-searcher with mismatching in Python (though I give no guarantees about running time). I'm willing to help if you're stuck, it sounds like an interesting problem.
            It's possible to use fqgrep for the approximative sequence search.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            26 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X