SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BFAST Alignment Troubleshoot jfellows88 Bioinformatics 3 01-24-2012 01:23 AM
Bfast alignment madhu Bioinformatics 3 08-24-2011 10:03 AM
c.elegans genome alignment recommended settings for BFAST hylei Bioinformatics 0 03-18-2011 08:36 AM
bfast gapped alignment Protaeus Bioinformatics 1 08-30-2010 10:33 PM
BFAST - Alignment for ABI or Illumina sequencing - with qualities nilshomer Bioinformatics 4 11-20-2009 11:35 AM

Reply
 
Thread Tools
Old 09-13-2010, 08:23 AM   #1
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default Speeding up alignment? Do Bowtie first, then BFAST?

Hi everyone,

Just wanted to pick anyone's brain and get their opinion on this.

I'm drowning in more and more sequence data , while at the same time, being limited in computing resources. I do not want to go to the cloud (not yet at least)!

I am running BFAST which does a great job for my tumor mutation analysis (allows more mismatches, and allows indels compared to Bowtie). But it is incredibly resource and time hungry, because it takes time to do all those local alignments.

My question is, if I were to do a first pass of the reads with Bowtie, say without mismatches in the seeds, and then take the left-over unmatched reads and align them with BFAST, would that be reasonable? or would I risk losing better alignments that might have been done with BFAST?

I am thinking that by getting the "near perfect" matching reads out of the way, I can then feed the rest to BFAST to handle the more complicated reads holding indels and multiple mistmatches...

What do you think? Is this a bad idea?
NGSfan is offline   Reply With Quote
Old 09-13-2010, 08:34 AM   #2
mrawlins
Member
 
Location: Retirement - Not working with bioinformatics anymore.

Join Date: Apr 2010
Posts: 63
Default

A similar approach is used by TopHat for splice junction analysis. In theory there is the possibility that you would lose some better alignments, but I don't think you would lose many in practice. The easiest matches will be made by pretty much all aligners; most of the differences come from difficult-to-match reads. It's also worth noting that if BFAST and bowtie get different results for a high-quality low-mismatch read, you will likely have problems with false positives for that read.

I would not expect this sort of two-stage process to produce much data loss, if any.
mrawlins is offline   Reply With Quote
Old 09-13-2010, 08:59 AM   #3
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Quote:
Originally Posted by NGSfan View Post
Hi everyone,
My question is, if I were to do a first pass of the reads with Bowtie, say without mismatches in the seeds, and then take the left-over unmatched reads and align them with BFAST, would that be reasonable? or would I risk losing better alignments that might have been done with BFAST?

What do you think? Is this a bad idea?
I think it is a good idea. One suggestion, try bwa also for the first iteration. It is both accurate and fast.

Have you done any testing in the cloud already? I'd love to heard more about it.
__________________
-drd
drio is offline   Reply With Quote
Old 09-13-2010, 01:35 PM   #4
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Thanks! I will try BWA since it seems a good compromise between speed and accuracy and might be better than bowtie for a first pass. Although BFAST still does a better job, particularly with larger indels (>10bp).

Btw - do you know if BWA will output the unaligned reads like Bowtie does?

I also saw some paper show Novoalign is pretty accurate, but I don't know how fast it is in comparison.
NGSfan is offline   Reply With Quote
Old 09-13-2010, 04:55 PM   #5
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by NGSfan View Post
Thanks! I will try BWA since it seems a good compromise between speed and accuracy and might be better than bowtie for a first pass. Although BFAST still does a better job, particularly with larger indels (>10bp).

Btw - do you know if BWA will output the unaligned reads like Bowtie does?

I also saw some paper show Novoalign is pretty accurate, but I don't know how fast it is in comparison.
BWA outputs unaligned reads. Novoalign is more accurate and sensitive than BWA/BFAST/others but is generally slower. How much slower is dependent on the type of data and compute infrastructure.
nilshomer is offline   Reply With Quote
Old 09-13-2010, 07:43 PM   #6
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

I started working with novoalign and I am pretty impressed. I am still doing some more testing but I am already seeing what Nils points out. He actually advised me to align with novoalign the unaligned reads from bfast. But at this point I am considering starting with
novoalign. We'll see how it scales.
__________________
-drd
drio is offline   Reply With Quote
Old 09-13-2010, 08:49 PM   #7
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Novoalign is nearly as fast aligning all reads as just the unaligned reads from Bowtie or Bfast as the unaligned reads are usually the most difficult to align and take the longest time, Novoaligns iterative alignment process flies through the easy to align reads.
The other thing to watch out for is false positive alignments produced by your first aligner as these won't be in the unaligned file and they will add noise to you SNV analysis, we recommend using Novoalign from the start.
As mentioned by Nils, Novoaligns performance can be affected by the data especially if you have a bad run with lots of low quality bases. The latest version of Novoalign has a quality filter (-p option) that can be used to filter low quality reads. You can also use the -l option to filter reads that have a lot of very low quality bases, set -l to about 2/3rds read length.
By default Novoalign allows a very high level of mismatches and long indels especially with long paired end reads. This slows down alignment especially for the reads that just won't align.
If you want faster alignment try decreasing the alignment threshold. We have users aligning a lane of 45bp paired end reads in 20 minutes on a 32-core server at -t 180. Default threshold is around 5*(l-15) where l is read length (sum of both reads for pairs), try setting -t to -3*(l-15) or even lower. A threshold of 250 would allow a 10bp indel and a couple of SNPs.
sparks is offline   Reply With Quote
Old 09-14-2010, 06:48 AM   #8
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

The experience from G1K was that given decent reads under the default option, novoalign was about 2-3X slower than bwa for 100bp reads, but >10X slower for 40bp reads. G1K opt out novoalign in the end partly because of this (G1K has produced a lot of 36bp reads) and also because its free version (at least at that time) did not support multi-threading while it took more than 6.5GB of memory.

As to accuracy, it depends on the applications you have. If you do chip/rna-seq, even bowtie is fine. If you want to find SNPs, the accuracy of bwa is acceptable. For indels, novoalign will be better but not much, I guess (no proof). For SVs, I think one should consider to take the intersection of two distinct aligners, like what hydra-sv is recommending.
lh3 is offline   Reply With Quote
Old 09-14-2010, 08:26 AM   #9
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by drio View Post
Have you done any testing in the cloud already? I'd love to heard more about it.
Sorry I did not answer your question. No, I haven't tried it, but it was a suggestion. What worries me is the idea of having to transfer all that data over some limited bandwidth connection.

I have seen some rather fast transfers (eg. 500gb transfered overnight) between cities on a company's internal network. So it's possible. I just don't know if transfering to the Amazon Cloud would be as fast.

I imagine FastQs and a reference genome + index might not be too bad... I have to look into it some more. Lots of conflicting opinions on the whole thing (cloud vs in-house) !
NGSfan is offline   Reply With Quote
Old 09-14-2010, 07:11 PM   #10
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

For our structural variation analyses with Hydra, we use a similar, tiered approach using BWA followed by Novoalign. Using even default settings, BWA is very fast and reasonably sensitive. We use Novoalign as a second pass on the discordant/aberrant pairs that BWA claims are not concordant with the reference genome. As discordant pairs are a primary signal for SV, one wants to be as sensitive as possible when deciding whether or not a given pair is discordant (else a burdensome load of false positives). We find that Novoalign does the best job we've seen at detecting "cryptic" concordant pairs that are otherwise missed by other aligners.

In addition, as lh3 mentions, Novoalign's speed improves substantially as read lengths and accuracy increase. As I understand, it has also undergone some algorithmic improvement that further expedites alignment. We've recently found that Novoalign is acceptably fast as both a first (less sensitive settings) and second (crank up the sensitivity!, -r E 1000) tier with recent 100bp paired-end human data having overall error rates less than 2%.

In short, substantial work has gone into improving alignment speed and sensitivity. The fact remains that alignment is everything when analyzing NGS data. In my experience, shortcuts during alignments lead to painful and artefactual analyses.

I hope this helps.
Aaron
quinlana is offline   Reply With Quote
Old 09-27-2010, 05:07 AM   #11
warrenemmett
Member
 
Location: South Africa

Join Date: Nov 2008
Posts: 23
Default

Specifically concerning RNA-seq analysis. The new tools to align spliced reads (such as MapSplice) are show to be more sensitive than Tophat (mentioned earlier) and I was wondering if in this case using these packages can in some ways provide better results (or at least drastically reduce the number of unaligned reads) than using a sensitive read aligner??

I am also curious if anyone has any figures regarding how many extra reads can be aligned using a second, more sensitive aligner (I realize this is highly situational but still interesting to see).
warrenemmett is offline   Reply With Quote
Old 09-27-2010, 08:12 AM   #12
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

The two aligner process is basically flawed as the first aligner, the less sensitive and less specific aligner, will also align some read in the wrong location (false positives) no amount of aligning the unmapped reads will get rid of these incorrect alignments from the first aligner.
sparks is offline   Reply With Quote
Old 09-28-2010, 06:27 AM   #13
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I think the common practice is to remap paired-end reads that are not mapped "properly" by the first aligner. When a read pair is mapped properly, the chance of seeing a wrong alignment is pretty low. Sometimes, one may also want to remap reads with too many mismatches, which is also a sign of misalignment.
lh3 is offline   Reply With Quote
Old 09-28-2010, 08:56 AM   #14
warrenemmett
Member
 
Location: South Africa

Join Date: Nov 2008
Posts: 23
Default

Quote:
Originally Posted by sparks View Post
The two aligner process is basically flawed as the first aligner, the less sensitive and less specific aligner, will also align some read in the wrong location (false positives) no amount of aligning the unmapped reads will get rid of these incorrect alignments from the first aligner.
Would you mind elaborating on this? I cannot see the difference between having novoalign running a first-pass fast alignment to map the majority of reads (if mapped incorrectly here wouldnt the reads remain incorrectly mapped? ) and then aligning the rest using more sensitive parameters compared to running a fast aligner with stringent match criteria followed by a more sensitive one.
warrenemmett is offline   Reply With Quote
Old 09-28-2010, 10:01 AM   #15
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Quote:
Originally Posted by lh3 View Post
I think the common practice is to remap paired-end reads that are not mapped "properly" by the first aligner. When a read pair is mapped properly, the chance of seeing a wrong alignment is pretty low. Sometimes, one may also want to remap reads with too many mismatches, which is also a sign of misalignment.
If you are really worried about this you could set the cutoffs so that only reads which couldn't possibly map better with a different algorithm would be mapped. Obvious case is perfect matches, but certainly some mismatches will be accepted by any algorithm.
krobison is offline   Reply With Quote
Old 09-28-2010, 06:31 PM   #16
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

@warren
I think you misunderstood and that we are agreeing. Maybe I wasn't clear but I was trying to say if a read was mapped in first pass to the wrong location or incorrectly aligned (i.e. with mismatches when it should have had an indel) then it will stay that way as it doesn't get to the unmapped file.
Using Bowtie(ungapped) then Novoalign and you'll have this problem.
Using Novoalign/Novoalign with a low threshold on the first pass is OK and will produce the same results as just one run at high threshold.
BWA as first aligner should be fine as it has low false positive rate.
sparks is offline   Reply With Quote
Old 09-28-2010, 06:45 PM   #17
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

I think Heng Li and krobison have it right. You need to take more than the unmapped reads, you need to take pairs that are not properly aligned and alignments with the number of mismatches or score above some limit. The limit will depend on the first aligner used.
sparks is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:12 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO