Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging paired reads - any software out there?

    Hello:
    Is there any software that can merge paired reads from short fragments into complete fragment sequences?
    We have sequenced short fragments (140 to 190 bp - actually the target was 180 - 220, but some size selection error occurred ) using Illumina paired end reads of 100 bp long. For most fragments, the two paired reads have sufficient overlap to merge them perfectly into a single fragment sequence. I have manually tested this on a few hundred reads: Take a read, find the corresponding paired read, reverse complement one of the sequences and align. I found only two sequences where there was a mismatch in the overlap region.
    But it is a little bit too much to do this manually on several million sequenced fragments.
    So is there any software that could handle this?

  • #2
    Try this:

    Comment


    • #3
      Hi,
      Another tool has been published recently, FLASH, and compared to SHERA:


      I didn't try any of them yet. It would be great if you tell us if it works !

      Emilie

      Comment


      • #4
        Merging sequences

        First of all, Thanks Chipper and Emilie.
        I have installed both SHERA and FLASH, and compared them. The differences between the two programmes are huge. SHERA does a very good job splicing the sequences, even when the combined sequences turn out to be derived from a fragment that is shorter than the read length (100 bp in my case). It does that by recognizing the adapter sequences (listed in a separate small file) and accordingly clipping them off from the spliced reads. However, there are two major drawbacks to SHERA. One is that it forces sequences to join, even if there should be a gap between the forward and reverse reads (i. e. the total length of the fragment that was sequenced is longer than the two reads). As a result, spliced reads are created that are wrong. Since the sequence reads I obtained were mostly of very high quality, I reduced the minimum overlap length to 3 bp instead of the default 10 bp, which reduced the amount of wrongly spliced reads, but still there were many left aside from about 4% that should never have been spliced at all.
        One of the strengths of FLASH is exactly that it has the escape not to splice reads when they do not meet some criteria (minimum overlap, maximum mismatch). However, FLASH cannot figure out the reads derived from fragments shorter than the read length, and dumps all those also in the "nonCombined" files.
        When comparing speed, then the conclusion is very clear. FLASH does it .... in a flash. FLASH took only about 30 minutes to go through 2 x 8 million reads, while SHERA too about 60 hours.
        So, my preferred program would have the speed of FLASH, its ability not to force splicing of reads, but would do a check for the presence of fragments shorter than the read length as does SHERA, combine the forward and reverse reads, while clipping of the linker sequences, the way SHERA does. I would also like it to have an option where the reads that cannot be properly merged because of an overlap shorter than the minimum to be actually spliced together, with some NNN and corresponding low quality scores in between them... to keep all information nicely in a single file instead of in three files.

        Comment


        • #5
          Hi Tectona,
          Thanks for your thoughtful review! Tanja Magoč and Steven L. Salzberg compare FLASH and SHE-RA in their Bioinformatics paper, but I haven't compared the two programs' results myself. It is obvious that FLASH is much much faster, but given that library prep and sequencing pipelines take weeks, don't let this be the sole determining factor in your choice of program (also, SHE-RA is also fully parallelizable so it needn't take 60 hours). The other drawback of SHE-RA that you mention is that it "forces sequences to join." This was actually a conscious choice on my part; I thought that reporting all joins (and an associated confidence metric for each) would allow the user to choose how stringently to filter reads (with the provided script, filterReads.pl), a choice that I thought might depend heavily on a user's sequencing quality and downstream applications. This is detailed in our plosone supplement (doi:10.1371/journal.pone.0011840). I have been trying to add commonly requested features to SHE-RA, so thanks for the feedback on what output formats would be useful for you.
          sonia

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          57 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          45 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X