Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • comparing seq data to itself for frequency

    What we're trying to do:
    Illumina single end reads
    ~30 million text strings made up of approximately 400,000 unique, unknown 20bp sequences flanked by known sequence (a "tag" if you will)
    need to count the frequency of each barcode sequence (with up to a 2bp mismatch) for entire data set
    the problem is that we don't have a reference to match the data to, as the sequences are unknown. Has anyone done anything like this, or know of software that might be able to do this?

    currently using matlab and a brute force technique, in which we compare each new sequence to all of the others before it, increase by one if it matches, or add it to the list if it is unique. This process is going to be exceedingly slow, hoping there is a better way!

    Thanks in advance!

  • #2
    Do you need to count the barcodes or the unique sequences between the barcodes?
    You could count the number of unique (full) sequences first (probably 2-3 mins) to reduce the number of sequences to process and then use those sequences to check for the barcodes using regex or some algo for approximate string matching.
    Do you know the original barcodes? Do your mismatches include indels?

    Comment


    • #3
      Oops, sorry, we refer to the unknown 20bp sequence as a "barcode", but I realize that the term means something else to the rest of the world.
      We do not know the original 20bp sequence, as they were created from randomized oligos. The mismatches will not include indels.
      A line of our data looks like this:
      GGCGCGCCNNNNNNNNNNNNNNNNNNNNGGCCAT
      With the ends being our unknown sequences, flanked by "known" sequence.
      Basically we are wanting to compare bases 9-29 of each line of data, and be able to count how many times each is found among the ~30 million lines of data.
      I hope that is clearer, thanks so much for your help!

      Comment


      • #4
        One more thing to think about. Since you want to group the sequences with 2 allowed mismatches, you run into the problem of clustering. You basically have to calculate the distance between all the sequences and then group them. There are different approaches on how to cluster or classify and each of them might give you a different number.

        I would suggest the following:
        1) extract the "unknown" sequence
        2) remove duplicates, but keep the counts
        3) calculate distance between all sequences (I would suggest hamming distance, since no indels)
        4) use cluster or classification method to get number of "groups" with max 2 mismatches

        This is kind of similar of finding OTUs for e.g. 16S sequences. There are a bunch of programs already designed to do the work for you. You could do step (1) and (2) and then input the data into one of those programs.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 08:47 AM
        0 responses
        16 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        54 views
        0 likes
        Last Post seqadmin  
        Working...
        X