![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
BFAST Alignment Troubleshoot | jfellows88 | Bioinformatics | 3 | 01-24-2012 01:23 AM |
Bfast alignment | madhu | Bioinformatics | 3 | 08-24-2011 10:03 AM |
c.elegans genome alignment recommended settings for BFAST | hylei | Bioinformatics | 0 | 03-18-2011 08:36 AM |
bfast gapped alignment | Protaeus | Bioinformatics | 1 | 08-30-2010 10:33 PM |
BFAST - Alignment for ABI or Illumina sequencing - with qualities | nilshomer | Bioinformatics | 4 | 11-20-2009 11:35 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]()
Hi everyone,
Just wanted to pick anyone's brain and get their opinion on this. I'm drowning in more and more sequence data , while at the same time, being limited in computing resources. I do not want to go to the cloud (not yet at least)! I am running BFAST which does a great job for my tumor mutation analysis (allows more mismatches, and allows indels compared to Bowtie). But it is incredibly resource and time hungry, because it takes time to do all those local alignments. My question is, if I were to do a first pass of the reads with Bowtie, say without mismatches in the seeds, and then take the left-over unmatched reads and align them with BFAST, would that be reasonable? or would I risk losing better alignments that might have been done with BFAST? I am thinking that by getting the "near perfect" matching reads out of the way, I can then feed the rest to BFAST to handle the more complicated reads holding indels and multiple mistmatches... What do you think? Is this a bad idea? |
![]() |
![]() |
![]() |
#2 |
Member
Location: Retirement - Not working with bioinformatics anymore. Join Date: Apr 2010
Posts: 63
|
![]()
A similar approach is used by TopHat for splice junction analysis. In theory there is the possibility that you would lose some better alignments, but I don't think you would lose many in practice. The easiest matches will be made by pretty much all aligners; most of the differences come from difficult-to-match reads. It's also worth noting that if BFAST and bowtie get different results for a high-quality low-mismatch read, you will likely have problems with false positives for that read.
I would not expect this sort of two-stage process to produce much data loss, if any. |
![]() |
![]() |
![]() |
#3 | |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]() Quote:
Have you done any testing in the cloud already? I'd love to heard more about it.
__________________
-drd |
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]()
Thanks! I will try BWA since it seems a good compromise between speed and accuracy and might be better than bowtie for a first pass. Although BFAST still does a better job, particularly with larger indels (>10bp).
Btw - do you know if BWA will output the unaligned reads like Bowtie does? I also saw some paper show Novoalign is pretty accurate, but I don't know how fast it is in comparison. |
![]() |
![]() |
![]() |
#5 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]()
I started working with novoalign and I am pretty impressed. I am still doing some more testing but I am already seeing what Nils points out. He actually advised me to align with novoalign the unaligned reads from bfast. But at this point I am considering starting with
novoalign. We'll see how it scales.
__________________
-drd |
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Kuala Lumpur, Malaysia Join Date: Mar 2008
Posts: 126
|
![]()
Novoalign is nearly as fast aligning all reads as just the unaligned reads from Bowtie or Bfast as the unaligned reads are usually the most difficult to align and take the longest time, Novoaligns iterative alignment process flies through the easy to align reads.
The other thing to watch out for is false positive alignments produced by your first aligner as these won't be in the unaligned file and they will add noise to you SNV analysis, we recommend using Novoalign from the start. As mentioned by Nils, Novoaligns performance can be affected by the data especially if you have a bad run with lots of low quality bases. The latest version of Novoalign has a quality filter (-p option) that can be used to filter low quality reads. You can also use the -l option to filter reads that have a lot of very low quality bases, set -l to about 2/3rds read length. By default Novoalign allows a very high level of mismatches and long indels especially with long paired end reads. This slows down alignment especially for the reads that just won't align. If you want faster alignment try decreasing the alignment threshold. We have users aligning a lane of 45bp paired end reads in 20 minutes on a 32-core server at -t 180. Default threshold is around 5*(l-15) where l is read length (sum of both reads for pairs), try setting -t to -3*(l-15) or even lower. A threshold of 250 would allow a 10bp indel and a couple of SNPs. |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: Boston Join Date: Feb 2008
Posts: 693
|
![]()
The experience from G1K was that given decent reads under the default option, novoalign was about 2-3X slower than bwa for 100bp reads, but >10X slower for 40bp reads. G1K opt out novoalign in the end partly because of this (G1K has produced a lot of 36bp reads) and also because its free version (at least at that time) did not support multi-threading while it took more than 6.5GB of memory.
As to accuracy, it depends on the applications you have. If you do chip/rna-seq, even bowtie is fine. If you want to find SNPs, the accuracy of bwa is acceptable. For indels, novoalign will be better but not much, I guess (no proof). For SVs, I think one should consider to take the intersection of two distinct aligners, like what hydra-sv is recommending. |
![]() |
![]() |
![]() |
#9 | |
Senior Member
Location: Austria Join Date: Apr 2009
Posts: 181
|
![]() Quote:
I have seen some rather fast transfers (eg. 500gb transfered overnight) between cities on a company's internal network. So it's possible. I just don't know if transfering to the Amazon Cloud would be as fast. I imagine FastQs and a reference genome + index might not be too bad... I have to look into it some more. Lots of conflicting opinions on the whole thing (cloud vs in-house) ! |
|
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Charlottesville Join Date: Sep 2008
Posts: 119
|
![]()
For our structural variation analyses with Hydra, we use a similar, tiered approach using BWA followed by Novoalign. Using even default settings, BWA is very fast and reasonably sensitive. We use Novoalign as a second pass on the discordant/aberrant pairs that BWA claims are not concordant with the reference genome. As discordant pairs are a primary signal for SV, one wants to be as sensitive as possible when deciding whether or not a given pair is discordant (else a burdensome load of false positives). We find that Novoalign does the best job we've seen at detecting "cryptic" concordant pairs that are otherwise missed by other aligners.
In addition, as lh3 mentions, Novoalign's speed improves substantially as read lengths and accuracy increase. As I understand, it has also undergone some algorithmic improvement that further expedites alignment. We've recently found that Novoalign is acceptably fast as both a first (less sensitive settings) and second (crank up the sensitivity!, -r E 1000) tier with recent 100bp paired-end human data having overall error rates less than 2%. In short, substantial work has gone into improving alignment speed and sensitivity. The fact remains that alignment is everything when analyzing NGS data. In my experience, shortcuts during alignments lead to painful and artefactual analyses. I hope this helps. Aaron |
![]() |
![]() |
![]() |
#11 |
Member
Location: South Africa Join Date: Nov 2008
Posts: 23
|
![]()
Specifically concerning RNA-seq analysis. The new tools to align spliced reads (such as MapSplice) are show to be more sensitive than Tophat (mentioned earlier) and I was wondering if in this case using these packages can in some ways provide better results (or at least drastically reduce the number of unaligned reads) than using a sensitive read aligner??
I am also curious if anyone has any figures regarding how many extra reads can be aligned using a second, more sensitive aligner (I realize this is highly situational but still interesting to see). |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: Kuala Lumpur, Malaysia Join Date: Mar 2008
Posts: 126
|
![]()
The two aligner process is basically flawed as the first aligner, the less sensitive and less specific aligner, will also align some read in the wrong location (false positives) no amount of aligning the unmapped reads will get rid of these incorrect alignments from the first aligner.
|
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: Boston Join Date: Feb 2008
Posts: 693
|
![]()
I think the common practice is to remap paired-end reads that are not mapped "properly" by the first aligner. When a read pair is mapped properly, the chance of seeing a wrong alignment is pretty low. Sometimes, one may also want to remap reads with too many mismatches, which is also a sign of misalignment.
|
![]() |
![]() |
![]() |
#14 | |
Member
Location: South Africa Join Date: Nov 2008
Posts: 23
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#15 | |
Senior Member
Location: Boston area Join Date: Nov 2007
Posts: 747
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: Kuala Lumpur, Malaysia Join Date: Mar 2008
Posts: 126
|
![]()
@warren
I think you misunderstood and that we are agreeing. Maybe I wasn't clear but I was trying to say if a read was mapped in first pass to the wrong location or incorrectly aligned (i.e. with mismatches when it should have had an indel) then it will stay that way as it doesn't get to the unmapped file. Using Bowtie(ungapped) then Novoalign and you'll have this problem. Using Novoalign/Novoalign with a low threshold on the first pass is OK and will produce the same results as just one run at high threshold. BWA as first aligner should be fine as it has low false positive rate. |
![]() |
![]() |
![]() |
#17 |
Senior Member
Location: Kuala Lumpur, Malaysia Join Date: Mar 2008
Posts: 126
|
![]()
I think Heng Li and krobison have it right. You need to take more than the unmapped reads, you need to take pairs that are not properly aligned and alignments with the number of mismatches or score above some limit. The limit will depend on the first aligner used.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|