Seqanswers Leaderboard Ad

**myrna** · 06-16-2008, 06:54 AM

Will it be open source?

Hi Colin.
This sounds cool.
Can you just confirm for us whether or not you plan to make this aligner open source?

**sparks** · 06-16-2008, 07:10 AM

Hi myrna,

At the moment it's not open source but it will be free for open projects and non-profit organisations.
I might make it open source if I had some funding.

Colin

**tree** · 07-07-2008, 06:48 PM

Novoalign test

I have done some testing of sparks’ program Novoalign.

This program seems to be incredibly fast. It requires only about 6 GB of physical RAM for aligning to human genome. Using simulated reads with no mismatches the program gives the same results as SOAP, however Novoalign is more than 100x faster (half million reads has taken only over half minute

on 3.6 GHz CPU).

I have tested some real SOLID reads translated to base space as well. Novoalign was very fast again relative to SOAP, 50 000 reads in 3 min. I used the trimming feature to help with alignment of reads that were mistranslated due to read errors. The results of uniquely mapped SOLID reads from Novoalign and SOAP were 99.96 % identical.

I would like to know whether ELAND which is supposed to be the fastest aligner would beat Novoalign

.

**zee** · 07-08-2008, 01:56 AM

I've been using novoalign as well and my bet is that ELAND should be faster than novoalign at default because novoalign will spend a little more time looking for those extra mismatches and gaps. At a threshold of 60 novoalign should be as fast as ELAND or perhaps a bit faster. ELAND achieves better performance because it indexes reads and does a fast scan of the genome.
Perhaps somebody would be willing to try it out. Take a few million paired-end/single-end reads and see how novoalign at threshold 60 would do in comparison to ELAND on the same server specification.

**lh3** · 07-15-2008, 02:02 AM

I have just tried novo*. A wonderful software. As previously, I only tried it on human chrX. It is as fast as eland. I kind of believe novo* should be faster on the whole human genome as indexing will be more efficient than on chrX.

(Sorry, I was wrong previously and so remove the paragraph. Quite amazing to me. And as I was wrong, novo* looks even superior.)

I think it is very important for novo* to support multithreading; otherwise parallelization would be a big problem.

Novopair does work for me and it improves overall alignment accuracy. However, novopair is overoptimistic about the alignment accuracy. The error rate of Q150 alignments is 0.05%. This error rate is good enough, but it would be better to improve this more or less. This may be of more theoretical concern.

In all, novo* is really a good set of programs. It is fast and integrates the advantages of most existing programs. I just hope the author could get funding and make it an open source project.

PS: So far as I know, only SOLiD's own software and shrimp fully supports color alignment. Maq does partially. Both novo* and soap do not support color alignments. Note that it is not right to do SOLiD alignment in the nucleotide space.

**zee** · 07-15-2008, 02:21 AM

see next...

**zee** · 07-15-2008, 02:23 AM

Thanks for comments Ih3. We're working on improving accuracy. Something to be aware of with novo is the alignment threshold, the "-t" parameter. Setting this very high e.g. -140, for single-end a alignment will report more false positives (FP) . It's always tricky working out the right default threshold. Setting it too high will escalate FP, and it's too low e.g. > -60, then you dont pick up enough.
I think the author will be aware of these technicalities and this sort of feedback will help to improve the software. The foreseeable plans are to keep it open for just about everybody in the research community.

**sparks** · 07-15-2008, 04:31 AM

Hi Li Heng,
Thanks for your kind comments.
Performance slows on larger genomes as more possible alignment locations are evaluated for each read. Additional memory helps here as it makes the index more specific and while it can be run on an 8GB RAM server (Full Human) a 16G or 32G server is going to be 4 or 5 times faster.
With regard multithreading the index is memory mapped and it's quite possible to run multiple copies of novoalign (same target genome) without any increase in memory. That said multithreading wouldn't be too difficult as search classes are all designed to handle it. I need to see if there is a real demand.
The quality calculation is similar in principle to maq, it is Bayesian Posterior probability that the alignment is wrong. Some factors are estimated and one possible problem is that I rate the reference genome at 2bits of entropy/base, this may be the cause of the high qualities.

I deliberately haven't done SOLID as I'd like to it properly or not at all. That said, if someone wants to try I suggest converting the reference genome to colour space rather than the reads to nucleotide space.

**sparks** · 07-15-2008, 04:42 AM

Just one more point, even though novoalign uses a k-mer index of the genome it is not a seeded alignment ala Blast/Blat/Shrimp. It's an iterative alignment that will match the read against k-mers in the index using a combinatorial approach (with gaps).

**lh3** · 07-15-2008, 05:08 AM

see below...

**lh3** · 07-15-2008, 05:09 AM

Originally posted by sparks View Post

Just one more point, even though novoalign uses a k-mer index of the genome it is not a seeded alignment ala Blast/Blat/Shrimp. It's an iterative alignment that will match the read against k-mers in the index using a combinatorial approach (with gaps).

Lately I could vaguely see the possibility that how this can be done. But I am still keen to see the details if you publish the algorithm some day. Nice work!

**sparks** · 07-15-2008, 05:17 AM

Think blastp type seeding with qualities replacing blossum matrix and add gaps.

**sparks** · 07-15-2008, 05:30 AM

I've been back and looked at or error rate on simulated reads and it's typically around 0.005% without selecting for quality. We've used maq simulate modified to insert longer indels and paf_utils (great tools) but we also had to modify this to allow a few extra bases uncertainty in alignment location as novo aligners are much more likely to add a few gaps into an alignment than perhaps maq does.

**sparks** · 07-30-2008, 10:49 PM

Hi all,
I've just put an update to novoalign & novopaired. This update improves quality scores for novopaired and also fixes a illegal instruction fault reported by one user.
You can download at www.novocraft.com
I've also changed the license term so it's free for any non-profit even if you don't publish in open journals.
Colin

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

New Short Read Aligner

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News