Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • clustering algorithm for single reads from transposon integrations

    We have Ion Torrent reads from retrovirus (transposon) integration sites in unsequenced genome and we need to cluster them by sequence identity. The first fifty bases of each read is always the transposon end and the rest is basically random piece of genomic DNA that flanks the insertion. We need to collapse or cluster the reads from each unique integration site together. Currently we use de novo assembly algorithms, but those perform poorely. We need to relax the stringency of alignment because of the sequencing errors, and then de novo assembly joins artificially clusters together. Our clusters should have length of only one read.

    Would anybody know of suitable algorithm to create these single read clusters?

  • #2
    Originally posted by Retro View Post
    We have Ion Torrent reads from retrovirus (transposon) integration sites in unsequenced genome and we need to cluster them by sequence identity. The first fifty bases of each read is always the transposon end and the rest is basically random piece of genomic DNA that flanks the insertion. We need to collapse or cluster the reads from each unique integration site together. Currently we use de novo assembly algorithms, but those perform poorely. We need to relax the stringency of alignment because of the sequencing errors, and then de novo assembly joins artificially clusters together. Our clusters should have length of only one read.

    Would anybody know of suitable algorithm to create these single read clusters?
    As I was preparing a response it became less clear exactly what you are trying to achieve. When you say that you want to relax the stringency of alignment associated with assembly and use a clustering approach, that makes since. When you say that clusters should contain one read, that seems completely in conflict with the previous statement. Could you clarify your post?

    Comment


    • #3
      Thanks for your response. The clusters should have a length of one read. They can contain for example 50 reads, but all reads start at position 1 ("left side" in aligned cluster). The reads in a cluster might differ in length based on the initial fragmentation.

      To make it more difficult, our reads come from a pool of animals, so in addition to sequencing errors we also see SNPs. That is why we cannot use assembly based on let's say 99% homology. The de novo algorithm then starts adding read to our clusters that extend the cluster in length, mosty based on random inverted repeats in the genomic tags.

      Comment


      • #4
        OK, finally I found a great program USEARCH (http://www.drive5.com/usearch/usearch_docs.html) that does exactly that.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Recent Advances in Sequencing Analysis Tools
          by seqadmin


          The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
          05-06-2024, 07:48 AM
        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 06:35 AM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 02:46 PM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-07-2024, 06:57 AM
        0 responses
        17 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-06-2024, 07:17 AM
        0 responses
        19 views
        0 likes
        Last Post seqadmin  
        Working...
        X