Seqanswers Leaderboard Ad

**NicoBxl** · 08-26-2010, 05:36 AM

it's faster than blast for small sequence (cpu and memory optimized)

**ffinkernagel** · 08-26-2010, 06:07 AM

Also, they're more sensitive.
Blast typically needs a number of 'high scoring segment pairs' to even start considering an alignment.

**Zigster** · 08-26-2010, 06:49 AM

Blast is just too slow - 100 million reads against a big genome would take days even on a large cluster.

Blat is fine for 454 reads.

**malachig** · 08-26-2010, 10:36 AM

blastn for DNA alignments can be sensitive if the right parameters are chosen (small word size in particular). It can find an alignment of a 42-mer with a multiple mismatches AND gaps. For example, using blastn with a word size of 11 to align 42-mers to a database of all human transcripts finds alignments with up to 6 mismatches and 2 gaps. Some next-gen aligners have arbitrary limits on the number of mismatches in a single read. Furthermore some next-gen aligners will fail to find an alignment if a mismatch or gap (or more than one of these) occurs within the beginning of the read, as this portion is used as a seed. Another advantage of blast is that all alignments are returned. If a read has 1000 alignments, 1000 alignments are reported. Another advantage is the ability to perform sub-string alignments. If the first or last read base positions of an Illumina run have very high error rates (e.g. the first three bases of many reads in a run are garbage), you may need to trim the reads to get successful alignment with some next-gen aligners. These aligners tend to be focused on aligning the entire read length. blast will find an alignment and report what position within the read that the alignment start and ends. Another advantage of BLAST is a more sensible treatment of N's. Some of the next-gen. aligners store bases in 2-bit format. Meaning they can only internally represent A,T,C,G. The solution is to randomly assign N's to one of the other bases, a solution that some may find imperfect.

As the other posts have indicated. All of these apparent advantages are trumped by the computational issue. BLAST is simply too slow. Speed is the main driving force behind the recent proliferation of aligners. And many of the advantages of BLAST suggested above are gradually being addressed by next-gen aligners...

**KevinLam** · 08-26-2010, 11:41 PM

Originally posted by malachig View Post

blastn for DNA alignments can be sensitive if the right parameters are chosen (small word size in particular). It can find an alignment of a 42-mer with a multiple mismatches AND gaps. For example, using blastn with a word size of 11 to align 42-mers to a database of all human transcripts finds alignments with up to 6 mismatches and 2 gaps. Some next-gen aligners have arbitrary limits on the number of mismatches in a single read. Furthermore some next-gen aligners will fail to find an alignment if a mismatch or gap (or more than one of these) occurs within the beginning of the read, as this portion is used as a seed. Another advantage of blast is that all alignments are returned. If a read has 1000 alignments, 1000 alignments are reported. Another advantage is the ability to perform sub-string alignments. If the first or last read base positions of an Illumina run have very high error rates (e.g. the first three bases of many reads in a run are garbage), you may need to trim the reads to get successful alignment with some next-gen aligners. These aligners tend to be focused on aligning the entire read length. blast will find an alignment and report what position within the read that the alignment start and ends. Another advantage of BLAST is a more sensible treatment of N's. Some of the next-gen. aligners store bases in 2-bit format. Meaning they can only internally represent A,T,C,G. The solution is to randomly assign N's to one of the other bases, a solution that some may find imperfect.

As the other posts have indicated. All of these apparent advantages are trumped by the computational issue. BLAST is simply too slow. Speed is the main driving force behind the recent proliferation of aligners. And many of the advantages of BLAST suggested above are gradually being addressed by next-gen aligners...

Good summary!
Might I add that some of the limitations of short read mappers can also be addressed post mapping like using GATK's Local realigner

http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels

**lh3** · 08-27-2010, 08:54 AM

Blast has other problems for short reads in addition to speed. Let's take 32bp reads as a little extreme example (32bp reads are rarely produced nowadays). By default, blast finds 11-mer exact hits as seeds. If two mismatches happen to occur at the 11th and the 22nd position, blast will not be able to find the hit. It cannot achieve the full sensitivity by eland/maq/bwa/soap2 (by default, bowtie does not guarantee full sensitivity). Although blast can find 3,4,5-mismatch hits by chance (again not fully sensitive), these hits are more likely to be artifacts especially when 2-mismatch hits are not guaranteed to be found. Slightly modified eland can also find a fraction of 3-mismatch hits.

Another problem with blast lies right in its local alignment. Suppose a true mutation occurs at the 4th bp of a read. Blast will trim off the first 4bp in alignment (by default, match=1 and mismatch=-3). Then you will see more reference bases mapped than alternate bases. This is reference bias. Although global-local alignment like eland has other problems (e.g. unalignable indels), it is less affected by this bias.

The two problems will be greatly alleviated by longer reads. For 100bp reads, I would guess the problems above are minor, but for 32bp reads, those short read aligners are better in almost all ways (faster, more sensitive and less bias). As to N, capable aligners (e.g. novoalign) do not have any problem with that. They may take the advantages of ambiguous base like R. I do not know if blast will do.

If we build index for the genome, the very inefficiency of blast comes from the fact that it loads only ONE read into memory, scans through the whole genome and then output. Most of scan is a purely waste of time. A better way to use blast is to concatenate multiple short sequences into one. Speed can be dramatically improved, although still much slower than modern aligners. I think the blast group have already noticed this trick in blast+.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Why do we use mapping programs instead of blast for mapping to a reference?

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News