Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fast tool for finding a subsequences?

    Hello!
    I'm new to bioinformatics and I have faced with such a problem.
    I'm looking for a fast, desirably multithreading (or which can be parallelized) tool, that can find all the occurences of certain short subsequence with several mismatches in every read in NGS output. Most of all mismatches are substitutions, not indels.
    For example, I have a lot (thousands) of sequences like this: "fdsjfsjdkdfjSPARjdskfjdskSPAMfddjskdsfjkSPAMdkdsfjk", and I need to get a matrix with positions of all the entries of "SPAM" in every sequence.
    I tried to use the patternMatch from R-Bioconductor and the Python package fuzzysearch, but they are not actually fast. "Motif" from Biopython does not seem quite right for my goals and also is not very fast.

    Do you know any suitable tool for me?

    I believe it exits and I do not need to reinvent the wheel.

    Thank you in advance.
    Last edited by Rammaria; 09-09-2014, 04:29 AM.

  • #2
    Do you know the pattern you want to look for (e.g. SPAM in the example above) then fuzznuc from EMBOSS may be an option.

    If you are looking for patterns de novo then a k-mer search (of the length you want) may work.

    Having to allow for mismatches would make this job lot more difficult.

    Then there may be the possibility of using "grep" in some creative ways (if you know the pattern you want to search for).

    Comment


    • #3
      BBDuk can find all instances of a string (up to 31 bp) allowing a set number of mismatches. It will not return a matrix of positions, but it can replace all instances with some symbol that is then easy to find with a different tool that is not capable of handling mismatches. For example:

      bbduk.sh -Xmx1g in=reads.fa out=masked.fa literal=ACGT k=4 ktrim=x hdist=1 rcomp=f

      For the input file:
      Code:
      >1
      GGGGGACGTGGGGGGGACGT
      the output would be
      Code:
      >1
      GGGGGxxxxGGGGGGGxxxx
      You can use the "hdist" flag to specify a hamming distance. And you can use the "rcomp" flag to determine whether only forward or both forward and reverse sequences will be replaced. It's fast and multithreaded.

      Edit - the functionality I suggested seems to be broken; I will investigate.
      Edit2 - Fixed now as of v33.40b.
      Last edited by Brian Bushnell; 09-04-2014, 09:27 AM.

      Comment


      • #4
        Another k-mer based approach would be mirabait from the MIRA v4 assembler. This assumes you pattern SPAM is short enough that you can use it directly as a k-mer to search for.

        Comment


        • #5
          Many thanks to you all for your great solutions!

          I know the pattern exactly (it must be strictly 'SPAM' in my example), so fuzznuc and bbduk seem to be right for me.
          Last edited by Rammaria; 09-09-2014, 03:33 AM.

          Comment


          • #6
            Brian, how can I cite bbduk and other bbtools if I use them?
            Last edited by Rammaria; 09-09-2014, 04:28 AM.

            Comment


            • #7
              They're not yet published, so you can just cite my name and the Sourceforge website (https://sourceforge.net/projects/bbmap/).

              Comment


              • #8
                Ok, thank you!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                33 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                34 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X