Uninterpretable behavior of Hisat2 with '-k' option

sdriscoll

I like code

Join Date: Sep 2009

Posts: 436
- Share
- Tweet
#1

Uninterpretable behavior of Hisat2 with '-k' option

04-26-2017, 10:42 PM

I just wanted to put this out here. I'm doing some work with simulated reads and it happens to matter when a read doesn't align which led me to look closely at the output of Hisat2 today. First off, the '-k' parameter. The program's command line help defines it as:

-k <int> (default: 5) report up to <int> alns per read

and the webpage as:

It searches for at most <int> distinct, primary alignments for each read. Primary alignments mean alignments whose alignment score is equal or higher than any other alignments. The search terminates when it can't find more distinct valid alignments, or when it finds <int>, whichever happens first.

Additonally..

For reads that have more than <int> distinct, valid alignments, hisat2 does not guarantee that the <int> alignments reported are the best possible in terms of alignment score.

If interpreted literally this means by setting -k we will limit the search for valid, distinct alignments to this number. It also means that hisat2 will report <int> alignments for reads that have more possible alignments than <int>, though it would not know the details of those alignments since we have limited the search.

Seems straightforward, right? This appears to be the only reporting option which leaves me desiring the default option in bowtie2 which will discover multiple alignments and report the BEST - even when there are multiple equally good BEST in which case it assigns an appropriate MAPQ of 0 and picks one alignment to report. Anyways...

Here's a little experiment I did with the '-k' parameter which left me baffled. Starting from -k 1

Code:

hisat2 -x <ref> -U <reads> -k 1

on a set of 876 reads I get this result:

Code:

876 reads; of these: 876 (100.00%) were unpaired; of these: 560 (63.93%) aligned 0 times 313 (35.73%) aligned exactly 1 time 3 (0.34%) aligned >1 times 36.07% overall alignment rate

Now '-k 5'

Code:

876 reads; of these: 876 (100.00%) were unpaired; of these: 589 (67.24%) aligned 0 times 260 (29.68%) aligned exactly 1 time 27 (3.08%) aligned >1 times 32.76% overall alignment rate

'-k 20'

Code:

876 reads; of these: 876 (100.00%) were unpaired; of these: 85 (9.70%) aligned 0 times 267 (30.48%) aligned exactly 1 time 524 (59.82%) aligned >1 times 90.30% overall alignment rate

'-k 200'

Code:

876 reads; of these: 876 (100.00%) were unpaired; of these: 54 (6.16%) aligned 0 times 267 (30.48%) aligned exactly 1 time 555 (63.36%) aligned >1 times 93.84% overall alignment rate

There are several problems here. One is maybe obvious. Why does the '-k' setting seem to change how many reads 'aligned 0 times'? According to the documentation if a read aligns more times than the value of '-k' you'll only get '-k' alignments along with no guarantee that they will be the best possible alignments. Related is the report of how many reads "aligned >1 times". If I have '-k 5' or '-k 20', according to the documentation, I should see the exact same number of reads that map ">1 times".

The second problem is only in one place. Notice that when I went from '-k 1' to '-k 5' the number of reads that aligned "exactly 1 time" when down from 313 to 260. I'm sorry but that makes no sense. I can forgive this, however, since with the search supposedly terminating after finding a single read how could the program know if a read aligns exactly one time. So in this case it's the language of the STDERR message that's misleading and should be changed. The other thing is when I went from '-k 5' to '-k 20' the number of reads that aligned "exactly 1 time" increased by 7. How is that possible?

Why when going from '-k 1' to '-k 5' did the total alignment rate go DOWN? That makes absolutely zero sense.

The final one is maybe picky but...if I have '-k 1', meaning it will stop searching for alignments after 1 alignment is found, how could it possibly have found 3 reads that align more than 1 time?

There is something wrong here...do you agree?

It does not seem to me that the '-k' setting should change the number of reads that can align. If a read can align 1000 times and I have -k set to 5 I should still get 5 alignments for that read because the docs say I'm only changing how much searching the program will do not that it will make decisions to report reads or not based on their number of alignments. It's also mind boggling how a read could change from aligning more than 1 time to aligning exactly 1 time as in the change from '-k 5' to '-k 20'. Unless those 7 reads were ones that previously were unaligned. But that's maybe worse. Why would a read that aligns "exactly 1 time" have NOT been aligned when the program was allowed to search for up to 5 alignments? Why would changing that to 20 make any difference at all for a read that aligns exactly 1 time?

Let me know what you think and if you've run into anything like this with Hisat2 before.

p.s.
Bowtie2, with default options gives me this result:

Code:

876 reads; of these: 876 (100.00%) were unpaired; of these: 37 (4.22%) aligned 0 times 233 (26.60%) aligned exactly 1 time 606 (69.18%) aligned >1 times 95.78% overall alignment rate

THAT'S what I like to see. This gives me knowledge of how many reads could align and within the alignments I know which ones were more ambiguous than others. The only issue is I'm losing out on spliced alignments which is why I need something like Hisat2 to work in this exact same way. I assumed Hisat2's default behavior was like that of Bowtie2 but...

/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
Tags: None

Previous template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Uninterpretable behavior of Hisat2 with '-k' option

Latest Articles

ad_right_rmr

News