I have received a project and could use some advice.
My task is to find sequence matches in mRNA databases across 20+ taxa. I was presented an EPA memorandum that outlines a method I was requested to replicate. I am not sure it is the best method and the memorandum didn't go into much detail. I played around a bit and could use a bit of advice.
Setup:
I have an unspecified number (I haven't been told yet) of dsRNA segments approximately 300bp in length. I need to match these to the human transcriptome as well as over 20 other taxa.
Criteria:
For each 300bp dsRNA, I am to find mRNA in the taxa that have 14 or more matches within a 21bp window. Then I sort the data by taxon, transcripts matched, and the annotation for the matched mRNA.
Approach (this is where I have questions):
The EPA memorandum says the Burrows-Wheeler Aligner (BWA) was used to align a 21-mer sliding window along the target transcriptomes to look for matches of 14 or greater within the window. The PI said to create all 21-mers using a sliding window along the dsRNA sequence. Easy enough.
Here are my questions:
Sample Execution:
./bwa mem -k 5 -B 1 -O2 -T 5 ../ncbi_dataset/GCF000001405.40.rna.fna ../seqA.fasta | gzip -3 > ../bwa_results/aln_seqA.sam.gz
Bitwise Flag == 0?
seqA_332_353 16 XM_011510229.4 7276 0 7S14M * 0 0 TGATCGGTGTAAATCCCATAT * NM:i:0 MD:Z:14 AS:i:14 XS:i:14
seqA_333_354 0 XM_017008212.3 1680 0 7S14M * 0 0 TATGGGATTTACACCGATCAA * NM:i:0 MD:Z:14 AS:i:14 XS:i:13
seqA_334_355 0 XM_017008212.3 1680 0 6S15M * 0 0 ATGGGATTTACACCGATCAAC * NM:i:1 MD:Z:14A0 AS:i:14 XS:i:13
seqA_335_356 0 XM_017008212.3 1680 0 5S14M2S * 0 0 TGGGATTTACACCGATCAACT * NM:i:0 MD:Z:14 AS:i:14 XS:i:13
My task is to find sequence matches in mRNA databases across 20+ taxa. I was presented an EPA memorandum that outlines a method I was requested to replicate. I am not sure it is the best method and the memorandum didn't go into much detail. I played around a bit and could use a bit of advice.
Setup:
I have an unspecified number (I haven't been told yet) of dsRNA segments approximately 300bp in length. I need to match these to the human transcriptome as well as over 20 other taxa.
Criteria:
For each 300bp dsRNA, I am to find mRNA in the taxa that have 14 or more matches within a 21bp window. Then I sort the data by taxon, transcripts matched, and the annotation for the matched mRNA.
Approach (this is where I have questions):
The EPA memorandum says the Burrows-Wheeler Aligner (BWA) was used to align a 21-mer sliding window along the target transcriptomes to look for matches of 14 or greater within the window. The PI said to create all 21-mers using a sliding window along the dsRNA sequence. Easy enough.
Here are my questions:
- Is BWA the best approach to use? I've never used BWA MEM for anything so small. Is there a better approach?
- How should I set the parameters for the BWA for this case? The defaults are inadequate, but I'm just taking stabs in the dark to see what falls out. So far, I have adjusted:
- Minimum seed length (-k) down to 3
- band width (-w) down to 7
- ignore alignment scores lower than (-T) range from 1 to 21
- gap open penalty (-O) between 1 and 6
- mismatch penalty (-B) between 1 and 4
- Why do I see a Bitwise Flag of 0? In adjusting the parameters, the resulting SAM will contain matches where the Bitwise Flag is 0. This seems like nonsense to me, suggesting that I may be on the wrong track.
Sample Execution:
./bwa mem -k 5 -B 1 -O2 -T 5 ../ncbi_dataset/GCF000001405.40.rna.fna ../seqA.fasta | gzip -3 > ../bwa_results/aln_seqA.sam.gz
Bitwise Flag == 0?
seqA_332_353 16 XM_011510229.4 7276 0 7S14M * 0 0 TGATCGGTGTAAATCCCATAT * NM:i:0 MD:Z:14 AS:i:14 XS:i:14
seqA_333_354 0 XM_017008212.3 1680 0 7S14M * 0 0 TATGGGATTTACACCGATCAA * NM:i:0 MD:Z:14 AS:i:14 XS:i:13
seqA_334_355 0 XM_017008212.3 1680 0 6S15M * 0 0 ATGGGATTTACACCGATCAAC * NM:i:1 MD:Z:14A0 AS:i:14 XS:i:13
seqA_335_356 0 XM_017008212.3 1680 0 5S14M2S * 0 0 TGGGATTTACACCGATCAACT * NM:i:0 MD:Z:14 AS:i:14 XS:i:13