I've been puzzling over this issue for a few days, and I'm hoping that someone here can help me out. I have a molecular barcode deduplication process that obtains a reference contig alignment for a set of input reads, and then aligns the input reads back to the contig to determine snp/indel information. I'm using bwa mem to align my reads back to my contig sequence. I'm using the -a option to bwa to provide ALL alignments back to the reference sequences, since many of my reference sequences are very similar to each other (some are even identical).
What I've found is that bwa does NOT return all alignments when -a is specified - there's some sort of internal logic that is preventing the alignment I'm looking for from appearing in the output. Here's some sample data that reproduces my problem:
reference sequences:
read to align to references:
Now, note that the last reference sequence is the one I want my read to align to. I indexed the reference fasta with bwa index and made a simple bwa mem call with -a as the only flag. While there are 11 reference sequences, only 10 alignments are reported out. If I delete one of the reference sequences and re-align, the alignment I'm looking for appears.
Can anyone explain why this is happening? I was even getting different results depending on the order of the sequences in the reference.
What I've found is that bwa does NOT return all alignments when -a is specified - there's some sort of internal logic that is preventing the alignment I'm looking for from appearing in the output. Here's some sample data that reproduces my problem:
reference sequences:
Code:
>GCATACAGATGATGCCCTCAACTCTATTGTGTTCAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGATGCCCTGAACGAAGATCTTGTGCTCATACATGGCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >GCAAACAGATGATGACCTCAACTCTATTGTGTTCAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGATGCCCTGAACGAAGATCTTGTGCTCATACATGGCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >AAACAGAAGGTCAAAGCTCAACTCTATTGTGTTCAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGTCAAAGCTGTTGATGTCCCAGATGATGCCCTGAACGAAGATCTTGTGCTCATACATGGCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >ACAAAGCCAGAAGATGACCAACTCTATTGTGTTCAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGATGATGCCCTGAACGAAGATCTTGTGCTCATACATGGCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTGAGATGATGCCCTGAACGAAGATCTTGTGCTCATACATGGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >ACTAACAGATGATGCCCTCAACTCTATTGTGTTCAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGATGCCCTGAACGAAGATCTTGTGCTCATACATGGCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >TAACAGAAGGTCAAAGCTCAACTCTATTGTGTTCAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGTCAAAGCTGTTGATGTCCCAGATGATGCCCTGAACGAAGATCTTGTGCTCATACATGGCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >TCGCACCTCACTGGTTAACAACTCTATTGTGTTCAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCACTGGTCAAAGCTGTTGATGTCCCAGATGATGCCCTGAACGAAGATCTTGTGCTCATACATGGCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >TAACAGTAGGTCAACGCTCAACTCTATTGTGTTCAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGTCAAAGCTGTTGATGTCCCAGATGATGCCCTGAACGAAGATCTTGTGCTCATACATGGCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >GACATGAACACAGCTCCACAACTCTATTGTGTTCAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCACCCCCACTGGTCAAAGCTGTTGATGTCCCAGATGATGCCCTGAACGAAGATCTTGTGCTCATACATGGCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >GGCCGGACCTTGTGTTCATGCGGGCAAGAGTCCAGA NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTGGTTGGGCGATTTCCTTCAAAGACCTGACCTTATGTGGCAGCAGCCTCTCAAGGTCCTCTGGACTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >TGGCACCTCGAAGATCTTCAACTCTATTGTGTTCAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGAAGATCTTGTGCTCATACATGGCCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Code:
@NS500180:28:H0YNMBGXX:2:22110:22781:12271_TGGCACCTCGAAGATCTTCAACTCTATTGTGTTCAC CGAAGATCTTGTGCTCATACATGGCGACCAAGGCTCCAAGCATGAATGGTGTGAGCTTGGTGAACACAATAGAGTTG + FFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFF<FFFFFFFFFFFFFFF7FFFFFFFFF<FFFFFFAFFFFFF
Can anyone explain why this is happening? I was even getting different results depending on the order of the sequences in the reference.
Comment