SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Short Read Micro re-Aligner (beta release) nilshomer Bioinformatics 27 04-17-2014 09:29 AM
short read aligner with 3 mismatch and one gap allowed NicoBxl Bioinformatics 2 11-09-2011 11:26 AM
The best short read aligner Deutsche Bioinformatics 4 04-14-2011 08:12 PM
Short Read Micro re-Aligner Paper nilshomer Literature Watch 0 10-29-2010 10:59 AM
Very Short Read aligner Rupinder Bioinformatics 1 06-02-2009 08:10 PM

Reply
 
Thread Tools
Old 06-16-2008, 06:13 AM   #1
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default New Short Read Aligner

Hi,
I've been working on a short read aligner and would like to find some beta testers. The suite includes single end and paired end read aligners.
Some features are:
  • Gaps up to 7bp, affine gap penalties
  • Can handle ambiguous codes in ref sequence.
  • Quality based scoring
  • Adapter stripping for miRNA reads
  • No heuristics - reports the best alignment
  • Options for handling multiple alignments includes none, random, all alignments.
  • Alignment Quality scores
  • Can use fasta, fastq, solexa fastq, prb input formats
  • Paired end with full Needleman-Wunsch on both ends.
  • Paired end accepts a structural variation penalty and the best alignment may be two independent ends if score with SV penalty is better than the best pair that fits the fragment length distribution.
  • Supports variable read lengths
  • Includes optional soft masking of repeats.

If anyone is interested in getting a copy for testing you can contact me novoalign <at> gmail ....
Beta version is for X86-64 Linux 64 bit.

Cheers, Colin

Last edited by sparks; 06-17-2008 at 12:51 AM.
sparks is offline   Reply With Quote
Old 06-16-2008, 07:54 AM   #2
myrna
Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 44
Default Will it be open source?

Hi Colin.
This sounds cool.
Can you just confirm for us whether or not you plan to make this aligner open source?
myrna is offline   Reply With Quote
Old 06-16-2008, 08:10 AM   #3
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Hi myrna,

At the moment it's not open source but it will be free for open projects and non-profit organisations.
I might make it open source if I had some funding.

Colin

Last edited by sparks; 06-16-2008 at 08:13 AM.
sparks is offline   Reply With Quote
Old 07-07-2008, 07:48 PM   #4
tree
Junior Member
 
Location: AZ

Join Date: May 2008
Posts: 1
Smile Novoalign test

I have done some testing of sparks’ program Novoalign.

This program seems to be incredibly fast. It requires only about 6 GB of physical RAM for aligning to human genome. Using simulated reads with no mismatches the program gives the same results as SOAP, however Novoalign is more than 100x faster (half million reads has taken only over half minute on 3.6 GHz CPU).

I have tested some real SOLID reads translated to base space as well. Novoalign was very fast again relative to SOAP, 50 000 reads in 3 min. I used the trimming feature to help with alignment of reads that were mistranslated due to read errors. The results of uniquely mapped SOLID reads from Novoalign and SOAP were 99.96 % identical.

I would like to know whether ELAND which is supposed to be the fastest aligner would beat Novoalign .

Last edited by tree; 07-07-2008 at 09:40 PM.
tree is offline   Reply With Quote
Old 07-08-2008, 02:56 AM   #5
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

I've been using novoalign as well and my bet is that ELAND should be faster than novoalign at default because novoalign will spend a little more time looking for those extra mismatches and gaps. At a threshold of 60 novoalign should be as fast as ELAND or perhaps a bit faster. ELAND achieves better performance because it indexes reads and does a fast scan of the genome.
Perhaps somebody would be willing to try it out. Take a few million paired-end/single-end reads and see how novoalign at threshold 60 would do in comparison to ELAND on the same server specification.
zee is offline   Reply With Quote
Old 07-15-2008, 03:02 AM   #6
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I have just tried novo*. A wonderful software. As previously, I only tried it on human chrX. It is as fast as eland. I kind of believe novo* should be faster on the whole human genome as indexing will be more efficient than on chrX.

(Sorry, I was wrong previously and so remove the paragraph. Quite amazing to me. And as I was wrong, novo* looks even superior.)

I think it is very important for novo* to support multithreading; otherwise parallelization would be a big problem.

Novopair does work for me and it improves overall alignment accuracy. However, novopair is overoptimistic about the alignment accuracy. The error rate of Q150 alignments is 0.05%. This error rate is good enough, but it would be better to improve this more or less. This may be of more theoretical concern.

In all, novo* is really a good set of programs. It is fast and integrates the advantages of most existing programs. I just hope the author could get funding and make it an open source project.

PS: So far as I know, only SOLiD's own software and shrimp fully supports color alignment. Maq does partially. Both novo* and soap do not support color alignments. Note that it is not right to do SOLiD alignment in the nucleotide space.

Last edited by lh3; 07-15-2008 at 03:41 AM.
lh3 is offline   Reply With Quote
Old 07-15-2008, 03:21 AM   #7
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

see next...

Last edited by zee; 07-15-2008 at 03:24 AM.
zee is offline   Reply With Quote
Old 07-15-2008, 03:23 AM   #8
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Thanks for comments Ih3. We're working on improving accuracy. Something to be aware of with novo is the alignment threshold, the "-t" parameter. Setting this very high e.g. -140, for single-end a alignment will report more false positives (FP) . It's always tricky working out the right default threshold. Setting it too high will escalate FP, and it's too low e.g. > -60, then you dont pick up enough.
I think the author will be aware of these technicalities and this sort of feedback will help to improve the software. The foreseeable plans are to keep it open for just about everybody in the research community.
zee is offline   Reply With Quote
Old 07-15-2008, 05:31 AM   #9
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Hi Li Heng,
Thanks for your kind comments.
Performance slows on larger genomes as more possible alignment locations are evaluated for each read. Additional memory helps here as it makes the index more specific and while it can be run on an 8GB RAM server (Full Human) a 16G or 32G server is going to be 4 or 5 times faster.
With regard multithreading the index is memory mapped and it's quite possible to run multiple copies of novoalign (same target genome) without any increase in memory. That said multithreading wouldn't be too difficult as search classes are all designed to handle it. I need to see if there is a real demand.
The quality calculation is similar in principle to maq, it is Bayesian Posterior probability that the alignment is wrong. Some factors are estimated and one possible problem is that I rate the reference genome at 2bits of entropy/base, this may be the cause of the high qualities.

I deliberately haven't done SOLID as I'd like to it properly or not at all. That said, if someone wants to try I suggest converting the reference genome to colour space rather than the reads to nucleotide space.
sparks is offline   Reply With Quote
Old 07-15-2008, 05:42 AM   #10
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Just one more point, even though novoalign uses a k-mer index of the genome it is not a seeded alignment ala Blast/Blat/Shrimp. It's an iterative alignment that will match the read against k-mers in the index using a combinatorial approach (with gaps).
sparks is offline   Reply With Quote
Old 07-15-2008, 06:08 AM   #11
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

see below...

Last edited by lh3; 07-15-2008 at 06:11 AM.
lh3 is offline   Reply With Quote
Old 07-15-2008, 06:09 AM   #12
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Quote:
Originally Posted by sparks View Post
Just one more point, even though novoalign uses a k-mer index of the genome it is not a seeded alignment ala Blast/Blat/Shrimp. It's an iterative alignment that will match the read against k-mers in the index using a combinatorial approach (with gaps).
Lately I could vaguely see the possibility that how this can be done. But I am still keen to see the details if you publish the algorithm some day. Nice work!
lh3 is offline   Reply With Quote
Old 07-15-2008, 06:17 AM   #13
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Think blastp type seeding with qualities replacing blossum matrix and add gaps.
sparks is offline   Reply With Quote
Old 07-15-2008, 06:30 AM   #14
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

I've been back and looked at or error rate on simulated reads and it's typically around 0.005% without selecting for quality. We've used maq simulate modified to insert longer indels and paf_utils (great tools) but we also had to modify this to allow a few extra bases uncertainty in alignment location as novo aligners are much more likely to add a few gaps into an alignment than perhaps maq does.
sparks is offline   Reply With Quote
Old 07-30-2008, 11:49 PM   #15
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Hi all,
I've just put an update to novoalign & novopaired. This update improves quality scores for novopaired and also fixes a illegal instruction fault reported by one user.
You can download at www.novocraft.com
I've also changed the license term so it's free for any non-profit even if you don't publish in open journals.
Colin
sparks is offline   Reply With Quote
Old 08-11-2008, 01:19 PM   #16
myrna
Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 44
Default Novoalign update?

Hi Colin.
I have been working with Novoalign a bit and am finding it useful in picking up indels and SNPs missed by other aligners. I am wondering if it can also pick up structural aberrations that I have missed using other approaches. Is there an update on the timelines for the following features, mentioned in the documentation:

"novostruct Uses paired end alignments to identify locations where the individual being sequenced is structurally different to the
reference sequences. This could be inter sequence variations such as large insertions, deletions and inversions or inter sequence variations.

Jul'08

novoasm Using results from novoalign and novopair calls SNPs and short indels.
ACE format output is provided for viewing of alignments.

Aug '08

novodensity Read density analysis for copy number, expression level and, peak detection.

Aug '08"

?

Thanks,

Ryan

Last edited by myrna; 08-11-2008 at 01:39 PM.
myrna is offline   Reply With Quote
Old 08-11-2008, 07:16 PM   #17
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Hey Myrna,

If you're interested in knowing more about what we're doing with SNP/Assembly, see http://www.novocraft.com/wiki/tiki-v...desc&forumId=1
zee is offline   Reply With Quote
Old 08-12-2008, 07:16 AM   #18
myrna
Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 44
Thumbs up Novocraft and Maq

Thanks for the link, this was just what I needed. I will give the Novoalign->Eland->Maq conversion a try. What do you see as the largest problem/concern caused by the loss of mapping scores in doing this conversion? Do you think there would be some way to scale the Novoalign scores to Maq's mapping quality scale such that you could include them?
myrna is offline   Reply With Quote
Old 08-12-2008, 07:28 AM   #19
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

This is an area we're trying to perfect at the moment. Basically you gotta know that novoalign mapping quality scores are meant to be as close to maq mapping qualities as we hope to get. Therefore scaling may not be necessary if we can show that low quality novoalign mapping qualities are the same as those for maq , and vice versa for maq.
The .map file is the key here because it contains this information and we're neglecting these by using eland format Therefore it's crucial for us to go from the text format in novoalign to the maq format whilst keeping all that useful information.
The good news is that because we're mapping more with novoalign you have more SNPs being called. We hope to have this format conversion with quality scores ready by next week.
Perhaps you can send me a private msg and I can provide you with some charts showing how these mapping qualities compare between novoalign and maq??
zee is offline   Reply With Quote
Old 08-12-2008, 04:23 PM   #20
myrna
Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 44
Default novoalign2maq

I would think that using the export file format as an intermediate (instead of the eland format) would allow you to get around the base (and mapping) quality issue. Heng Li, have you (or anyone else) attempted to convert novo* outputs into native Maq alignment files?
myrna is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:22 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO