Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Uninterpretable behavior of Hisat2 with '-k' option

    I just wanted to put this out here. I'm doing some work with simulated reads and it happens to matter when a read doesn't align which led me to look closely at the output of Hisat2 today. First off, the '-k' parameter. The program's command line help defines it as:
    -k <int> (default: 5) report up to <int> alns per read
    and the webpage as:
    It searches for at most <int> distinct, primary alignments for each read. Primary alignments mean alignments whose alignment score is equal or higher than any other alignments. The search terminates when it can't find more distinct valid alignments, or when it finds <int>, whichever happens first.
    Additonally..
    For reads that have more than <int> distinct, valid alignments, hisat2 does not guarantee that the <int> alignments reported are the best possible in terms of alignment score.
    If interpreted literally this means by setting -k we will limit the search for valid, distinct alignments to this number. It also means that hisat2 will report <int> alignments for reads that have more possible alignments than <int>, though it would not know the details of those alignments since we have limited the search.

    Seems straightforward, right? This appears to be the only reporting option which leaves me desiring the default option in bowtie2 which will discover multiple alignments and report the BEST - even when there are multiple equally good BEST in which case it assigns an appropriate MAPQ of 0 and picks one alignment to report. Anyways...

    Here's a little experiment I did with the '-k' parameter which left me baffled. Starting from -k 1
    Code:
    hisat2 -x <ref> -U <reads> -k 1
    on a set of 876 reads I get this result:
    Code:
    876 reads; of these:
      876 (100.00%) were unpaired; of these:
        560 (63.93%) aligned 0 times
        313 (35.73%) aligned exactly 1 time
        3 (0.34%) aligned >1 times
    36.07% overall alignment rate
    Now '-k 5'
    Code:
    876 reads; of these:
      876 (100.00%) were unpaired; of these:
        589 (67.24%) aligned 0 times
        260 (29.68%) aligned exactly 1 time
        27 (3.08%) aligned >1 times
    32.76% overall alignment rate
    '-k 20'
    Code:
    876 reads; of these:
      876 (100.00%) were unpaired; of these:
        85 (9.70%) aligned 0 times
        267 (30.48%) aligned exactly 1 time
        524 (59.82%) aligned >1 times
    90.30% overall alignment rate
    '-k 200'
    Code:
    876 reads; of these:
      876 (100.00%) were unpaired; of these:
        54 (6.16%) aligned 0 times
        267 (30.48%) aligned exactly 1 time
        555 (63.36%) aligned >1 times
    93.84% overall alignment rate
    There are several problems here. One is maybe obvious. Why does the '-k' setting seem to change how many reads 'aligned 0 times'? According to the documentation if a read aligns more times than the value of '-k' you'll only get '-k' alignments along with no guarantee that they will be the best possible alignments. Related is the report of how many reads "aligned >1 times". If I have '-k 5' or '-k 20', according to the documentation, I should see the exact same number of reads that map ">1 times".

    The second problem is only in one place. Notice that when I went from '-k 1' to '-k 5' the number of reads that aligned "exactly 1 time" when down from 313 to 260. I'm sorry but that makes no sense. I can forgive this, however, since with the search supposedly terminating after finding a single read how could the program know if a read aligns exactly one time. So in this case it's the language of the STDERR message that's misleading and should be changed. The other thing is when I went from '-k 5' to '-k 20' the number of reads that aligned "exactly 1 time" increased by 7. How is that possible?

    Why when going from '-k 1' to '-k 5' did the total alignment rate go DOWN? That makes absolutely zero sense.

    The final one is maybe picky but...if I have '-k 1', meaning it will stop searching for alignments after 1 alignment is found, how could it possibly have found 3 reads that align more than 1 time?

    There is something wrong here...do you agree?

    It does not seem to me that the '-k' setting should change the number of reads that can align. If a read can align 1000 times and I have -k set to 5 I should still get 5 alignments for that read because the docs say I'm only changing how much searching the program will do not that it will make decisions to report reads or not based on their number of alignments. It's also mind boggling how a read could change from aligning more than 1 time to aligning exactly 1 time as in the change from '-k 5' to '-k 20'. Unless those 7 reads were ones that previously were unaligned. But that's maybe worse. Why would a read that aligns "exactly 1 time" have NOT been aligned when the program was allowed to search for up to 5 alignments? Why would changing that to 20 make any difference at all for a read that aligns exactly 1 time?

    Let me know what you think and if you've run into anything like this with Hisat2 before.

    p.s.
    Bowtie2, with default options gives me this result:
    Code:
    876 reads; of these:
      876 (100.00%) were unpaired; of these:
        37 (4.22%) aligned 0 times
        233 (26.60%) aligned exactly 1 time
        606 (69.18%) aligned >1 times
    95.78% overall alignment rate
    THAT'S what I like to see. This gives me knowledge of how many reads could align and within the alignments I know which ones were more ambiguous than others. The only issue is I'm losing out on spliced alignments which is why I need something like Hisat2 to work in this exact same way. I assumed Hisat2's default behavior was like that of Bowtie2 but...
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
31 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
32 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Working...
X