Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying (and removing) 454 homopolymers frameshifts

    Hi!

    I have resequenced two strains of the same bacteria using 454 data. It seems that the run was rather crappy, and I got a lot of homopolymer errors (in average 1 frameshift per gene). I need to get rid of those errors, without doing any further sequencing. My purpose is to compare substitution rates in these strains, so I don't really need extra-accurate sequences.

    For the moment, my strategy is the following:
    - Use the reference strain genes as a query with BLAST-Extend-Repraze (BER, http://ber.sourceforge.net/) tool.
    - Parse the output of this (which seems not quite straightforward).
    - Remove extra nucleotides in my sequence, or add Ns.

    Shortly, BER takes the output of a regular blastp of your favorite protein, retrieves hits that match nicely against it, goes back to the DNA sequence corresponding to your protein (i.e, the CDS + some flanking sequence).

    Is there any other tool that I'm not aware of? Do you know any BER output parser? Does my strategy make sense to you?

    Thanks for your advice and comments!

  • #2
    I have the same problem with errors in homopolymers. Have you finally found a tool to correct them in comparison with a refrence genome?

    Best regards

    Comment


    • #3
      AmpliconNoise http://code.google.com/p/ampliconnoise/downloads/list is the best tool I've found for removing homopolymer error.

      Comment


      • #4
        running AmpliconNoise

        SeaJane, have you been able to get AmpliconNoise up and running?

        Comment


        • #5
          I can install AmpliconNoise, yet all the steps for running it do seem daunting, especially with the number of datasets I have.

          Has anyone got any scripts etc for automation available ? After all this is a fairly common task for 454 users.

          Comment


          • #6
            HMM-FRAME: 'state of the art' in pyrosequencing frameshift correction and protein domain classification (includes 454 error model in Viterbi HMMER algorithm). Unlike AmpliconNoise, it only corrects coding reads

            Background Protein domain classification is an important step in metagenomic annotation. The state-of-the-art method for protein domain classification is profile HMM-based alignment. However, the relatively high rates of insertions and deletions in homopolymer regions of pyrosequencing reads create frameshifts, causing conventional profile HMM alignment tools to generate alignments with marginal scores. This makes error-containing gene fragments unclassifiable with conventional tools. Thus, there is a need for an accurate domain classification tool that can detect and correct sequencing errors. Results We introduce HMM-FRAME, a protein domain classification tool based on an augmented Viterbi algorithm that can incorporate error models from different sequencing platforms. HMM-FRAME corrects sequencing errors and classifies putative gene fragments into domain families. It achieved high error detection sensitivity and specificity in a data set with annotated errors. We applied HMM-FRAME in Targeted Metagenomics and a published metagenomic data set. The results showed that our tool can correct frameshifts in error-containing sequences, generate much longer alignments with significantly smaller E-values, and classify more sequences into their native families. Conclusions HMM-FRAME provides a complementary protein domain classification tool to conventional profile HMM-based methods for data sets containing frameshifts. Its current implementation is best used for small-scale metagenomic data sets. The source code of HMM-FRAME can be downloaded at http://www.cse.msu.edu/~zhangy72/hmmframe/ and at https://sourceforge.net/projects/hmm-frame/ .

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            47 views
            0 likes
            Last Post seqadmin  
            Working...
            X