Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Uninterpretable behavior of Hisat2 with '-k' option

    I just wanted to put this out here. I'm doing some work with simulated reads and it happens to matter when a read doesn't align which led me to look closely at the output of Hisat2 today. First off, the '-k' parameter. The program's command line help defines it as:
    -k <int> (default: 5) report up to <int> alns per read
    and the webpage as:
    It searches for at most <int> distinct, primary alignments for each read. Primary alignments mean alignments whose alignment score is equal or higher than any other alignments. The search terminates when it can't find more distinct valid alignments, or when it finds <int>, whichever happens first.
    Additonally..
    For reads that have more than <int> distinct, valid alignments, hisat2 does not guarantee that the <int> alignments reported are the best possible in terms of alignment score.
    If interpreted literally this means by setting -k we will limit the search for valid, distinct alignments to this number. It also means that hisat2 will report <int> alignments for reads that have more possible alignments than <int>, though it would not know the details of those alignments since we have limited the search.

    Seems straightforward, right? This appears to be the only reporting option which leaves me desiring the default option in bowtie2 which will discover multiple alignments and report the BEST - even when there are multiple equally good BEST in which case it assigns an appropriate MAPQ of 0 and picks one alignment to report. Anyways...

    Here's a little experiment I did with the '-k' parameter which left me baffled. Starting from -k 1
    Code:
    hisat2 -x <ref> -U <reads> -k 1
    on a set of 876 reads I get this result:
    Code:
    876 reads; of these:
      876 (100.00%) were unpaired; of these:
        560 (63.93%) aligned 0 times
        313 (35.73%) aligned exactly 1 time
        3 (0.34%) aligned >1 times
    36.07% overall alignment rate
    Now '-k 5'
    Code:
    876 reads; of these:
      876 (100.00%) were unpaired; of these:
        589 (67.24%) aligned 0 times
        260 (29.68%) aligned exactly 1 time
        27 (3.08%) aligned >1 times
    32.76% overall alignment rate
    '-k 20'
    Code:
    876 reads; of these:
      876 (100.00%) were unpaired; of these:
        85 (9.70%) aligned 0 times
        267 (30.48%) aligned exactly 1 time
        524 (59.82%) aligned >1 times
    90.30% overall alignment rate
    '-k 200'
    Code:
    876 reads; of these:
      876 (100.00%) were unpaired; of these:
        54 (6.16%) aligned 0 times
        267 (30.48%) aligned exactly 1 time
        555 (63.36%) aligned >1 times
    93.84% overall alignment rate
    There are several problems here. One is maybe obvious. Why does the '-k' setting seem to change how many reads 'aligned 0 times'? According to the documentation if a read aligns more times than the value of '-k' you'll only get '-k' alignments along with no guarantee that they will be the best possible alignments. Related is the report of how many reads "aligned >1 times". If I have '-k 5' or '-k 20', according to the documentation, I should see the exact same number of reads that map ">1 times".

    The second problem is only in one place. Notice that when I went from '-k 1' to '-k 5' the number of reads that aligned "exactly 1 time" when down from 313 to 260. I'm sorry but that makes no sense. I can forgive this, however, since with the search supposedly terminating after finding a single read how could the program know if a read aligns exactly one time. So in this case it's the language of the STDERR message that's misleading and should be changed. The other thing is when I went from '-k 5' to '-k 20' the number of reads that aligned "exactly 1 time" increased by 7. How is that possible?

    Why when going from '-k 1' to '-k 5' did the total alignment rate go DOWN? That makes absolutely zero sense.

    The final one is maybe picky but...if I have '-k 1', meaning it will stop searching for alignments after 1 alignment is found, how could it possibly have found 3 reads that align more than 1 time?

    There is something wrong here...do you agree?

    It does not seem to me that the '-k' setting should change the number of reads that can align. If a read can align 1000 times and I have -k set to 5 I should still get 5 alignments for that read because the docs say I'm only changing how much searching the program will do not that it will make decisions to report reads or not based on their number of alignments. It's also mind boggling how a read could change from aligning more than 1 time to aligning exactly 1 time as in the change from '-k 5' to '-k 20'. Unless those 7 reads were ones that previously were unaligned. But that's maybe worse. Why would a read that aligns "exactly 1 time" have NOT been aligned when the program was allowed to search for up to 5 alignments? Why would changing that to 20 make any difference at all for a read that aligns exactly 1 time?

    Let me know what you think and if you've run into anything like this with Hisat2 before.

    p.s.
    Bowtie2, with default options gives me this result:
    Code:
    876 reads; of these:
      876 (100.00%) were unpaired; of these:
        37 (4.22%) aligned 0 times
        233 (26.60%) aligned exactly 1 time
        606 (69.18%) aligned >1 times
    95.78% overall alignment rate
    THAT'S what I like to see. This gives me knowledge of how many reads could align and within the alignments I know which ones were more ambiguous than others. The only issue is I'm losing out on spliced alignments which is why I need something like Hisat2 to work in this exact same way. I assumed Hisat2's default behavior was like that of Bowtie2 but...
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 06:37 PM
0 responses
10 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 06:07 PM
0 responses
9 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
51 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
67 views
0 likes
Last Post seqadmin  
Working...
X