Go Back   SEQanswers > Applications Forums > De novo discovery

Similar Threads
Thread Thread Starter Forum Replies Last Post
Merging reads mapped on genome and CDS (SOLID data) jmandel Bioinformatics 1 01-04-2012 06:34 AM
Can we extract f3 reads while f5 reads are being sequenced in paired end Raa Bioinformatics 2 12-25-2011 08:46 PM
Is it a software that can show paired end mapping? songsy Bioinformatics 3 11-04-2011 04:30 PM
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? danwiththeplan Bioinformatics 2 09-22-2011 02:06 AM
Paired-sample (tumor/normal) somatic mutation detection software alexischr Bioinformatics 1 04-14-2011 04:56 AM

Thread Tools
Old 11-23-2011, 08:22 AM   #1
Location: Thailand

Join Date: Feb 2011
Posts: 11
Default merging paired reads - any software out there?

Is there any software that can merge paired reads from short fragments into complete fragment sequences?
We have sequenced short fragments (140 to 190 bp - actually the target was 180 - 220, but some size selection error occurred ) using Illumina paired end reads of 100 bp long. For most fragments, the two paired reads have sufficient overlap to merge them perfectly into a single fragment sequence. I have manually tested this on a few hundred reads: Take a read, find the corresponding paired read, reverse complement one of the sequences and align. I found only two sequences where there was a mismatch in the overlap region.
But it is a little bit too much to do this manually on several million sequenced fragments.
So is there any software that could handle this?
Tectona is offline   Reply With Quote
Old 11-23-2011, 09:15 AM   #2
Senior Member
Location: Sweden

Join Date: Mar 2008
Posts: 324

Try this:
Chipper is offline   Reply With Quote
Old 11-24-2011, 07:01 AM   #3
Location: Toronto

Join Date: Nov 2010
Posts: 21

Another tool has been published recently, FLASH, and compared to SHERA:
I didn't try any of them yet. It would be great if you tell us if it works !

Emilie is offline   Reply With Quote
Old 12-18-2011, 10:25 PM   #4
Location: Thailand

Join Date: Feb 2011
Posts: 11
Thumbs up Merging sequences

First of all, Thanks Chipper and Emilie.
I have installed both SHERA and FLASH, and compared them. The differences between the two programmes are huge. SHERA does a very good job splicing the sequences, even when the combined sequences turn out to be derived from a fragment that is shorter than the read length (100 bp in my case). It does that by recognizing the adapter sequences (listed in a separate small file) and accordingly clipping them off from the spliced reads. However, there are two major drawbacks to SHERA. One is that it forces sequences to join, even if there should be a gap between the forward and reverse reads (i. e. the total length of the fragment that was sequenced is longer than the two reads). As a result, spliced reads are created that are wrong. Since the sequence reads I obtained were mostly of very high quality, I reduced the minimum overlap length to 3 bp instead of the default 10 bp, which reduced the amount of wrongly spliced reads, but still there were many left aside from about 4% that should never have been spliced at all.
One of the strengths of FLASH is exactly that it has the escape not to splice reads when they do not meet some criteria (minimum overlap, maximum mismatch). However, FLASH cannot figure out the reads derived from fragments shorter than the read length, and dumps all those also in the "nonCombined" files.
When comparing speed, then the conclusion is very clear. FLASH does it .... in a flash. FLASH took only about 30 minutes to go through 2 x 8 million reads, while SHERA too about 60 hours.
So, my preferred program would have the speed of FLASH, its ability not to force splicing of reads, but would do a check for the presence of fragments shorter than the read length as does SHERA, combine the forward and reverse reads, while clipping of the linker sequences, the way SHERA does. I would also like it to have an option where the reads that cannot be properly merged because of an overlap shorter than the minimum to be actually spliced together, with some NNN and corresponding low quality scores in between them... to keep all information nicely in a single file instead of in three files.
Tectona is offline   Reply With Quote
Old 12-19-2011, 04:29 AM   #5
Junior Member
Location: Cambridge

Join Date: Oct 2008
Posts: 2

Hi Tectona,
Thanks for your thoughtful review! Tanja Magoč and Steven L. Salzberg compare FLASH and SHE-RA in their Bioinformatics paper, but I haven't compared the two programs' results myself. It is obvious that FLASH is much much faster, but given that library prep and sequencing pipelines take weeks, don't let this be the sole determining factor in your choice of program (also, SHE-RA is also fully parallelizable so it needn't take 60 hours). The other drawback of SHE-RA that you mention is that it "forces sequences to join." This was actually a conscious choice on my part; I thought that reporting all joins (and an associated confidence metric for each) would allow the user to choose how stringently to filter reads (with the provided script,, a choice that I thought might depend heavily on a user's sequencing quality and downstream applications. This is detailed in our plosone supplement (doi:10.1371/journal.pone.0011840). I have been trying to add commonly requested features to SHE-RA, so thanks for the feedback on what output formats would be useful for you.
soniat is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 02:42 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO