Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • correcting homopolymer run errors

    Hi all,

    We have been running de novo assembly of a eukaryotic genome, using 454 titanium together with gsAssembler. When we compare our assembly with cloned cDNA fragments (sequenced with Sanger) we find some homopolymer errors. So we were wondering:

    - Are there any reports on how common these errors are (especially in coding regions)?

    - How have people dealt with these problems? We were thinking about running Illumina or SOLiD (which would give us 50-100x coverage) and use these data to correct the homopolymer run errors. Do you know of any programs or papers that might help?

    thanks
    /Jakub
    Last edited by 454andSolid; 04-21-2010, 02:36 AM.

  • #2
    Originally posted by 454andSolid View Post
    Hi all,

    We have been running de novo assembly of a eukaryotic genome, using 454 titanium together with gsAssembler. When we compare our assembly with cloned cDNA fragments (sequenced with Sanger) we find some homopolymer errors. So we were wondering:

    - Are there any reports on how common these errors are (especially in coding regions)?

    - How have people dealt with these problems? We were thinking about running Illumina or SOLiD (which would give us 50-100x coverage) and using these data to correct the homopolymer run errors. Do you know of any programs or papers that might help?

    thanks
    /Jakub
    I have to say at the time of answering, I've been looking for solutions to this with SOLiD data to correct 454 homopolymer errors, and come up short. I know there are some people working on this, but with the NGS workflow focused on resequencing and SNP detection, the finishing of denovo 454 assemblies with additional data, especially from SOLiD runs, seems to be a sadly neglected area.

    I'd be delighted to hear otherwise from someone..

    Comment


    • #3
      There are a couple of other messages on this forum about this. Also several papers are out there too, using Pubmed should get you some good information.

      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


      As far as I know, the only implemented script is the one mentioned here by Torst.

      Comment


      • #4
        Originally posted by 454andSolid View Post
        We have been running de novo assembly of a eukaryotic genome, using 454 titanium together with gsAssembler. When we compare our assembly with cloned cDNA fragments (sequenced with Sanger) we find some homopolymer errors. So we were wondering:
        - Are there any reports on how common these errors are (especially in coding regions)?
        - How have people dealt with these problems? We were thinking about running Illumina or SOLiD (which would give us 50-100x coverage) and use these data to correct the homopolymer run errors. Do you know of any programs or papers that might help?
        The homopolymer errors can occur wherever the true sequence has about three or more of the same bases in a row. If this happens more in coding regions, then they will be affected more. It's genome dependent. In bacteria, which are coding-dense, this means all homopolymer errors result in frame-shifts in genes :-(

        We use Illumina and SOLiD short reads to correct 454 scaffolds produced by gsAssembler/Newbler. We don't correct the reads themselves, rather the contigs or scaffolds that are assembled by gsAssembler.

        As colindaven said, I explain on this thread http://seqanswers.com/forums/showthread.php?t=3635 how our software Nesoni could be used for this purpose. The key is using a read mapper which is good at detecting INDELs - detecting SNPs is not much use in fixing homopolymer errors.

        Comment


        • #5
          I will try using Nesoni with our transcriptome data.

          Thanks for the advice!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          18 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          47 views
          0 likes
          Last Post seqadmin  
          Working...
          X