SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Wrong bwa alignment? (http://seqanswers.com/forums/showthread.php?t=91470)

radood 10-24-2019 08:19 AM

Wrong bwa alignment?
 
Hi,
Below is a supplementary alignment for one of my reads with a mapping quality score of 43.This should map to chr2: 29446797 in hg19

03012p9SX:030B 2064 1 29446797 43 36M241H -1 -1 36 CCCCACGTGAACGAGGGAGGGAGGGAGGGTTGGGTG array('B', [32, 37, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 37, 41, 41, 41, 41, 41, 41, 41, 41, 37, 41, 41]) [('SA', 'chr2,42479618,+,238M39S,60,2;'), ('MD', '26A9'), ('RG', 'SAMPLE.sample23'), ('NM', 1), ('AS', 31), ('XS', 23), ('fm', '3_0')]

Since the read is on the negative strand, below is the forward sequence:
CACCCAACCCTCCCTCCCTCCCTCGTTCACGTGGGG

However this sequence does not map to chr2: 29446797. I tried blasting but I think it is too short so I aligned it to the genomic sequence as you can see here
https://ibb.co/R7VHrQH
This is not a good alignment at all. Is BWA wrong? Am I missing something?
Thank you for your input in advance!

radood 10-28-2019 12:27 PM

Checking to see if folks have feedback on this. I read that "BWA actually follows the SAM spec and reports Phred scores as MAPQ values.* The calculation is based on the number of optimal (best) alignments found, as well as the number of sub-optimal alignments combined with the Phred scores of the bases which differ between the optimal and sub-optimal alignments."

Does that mean that the mapq score of a clipped read is a combination of the quality of the 'matched' alignment AND the quality of the alignment of the clipped bases? Perhaps this explains the strange finding I saw above.

If so, how does one extract the mapping quality of only the 'matched' portion of the read at this location (29446797)?

r.rosati 10-29-2019 09:06 AM

Hi! Yes you can BLAST it, but since it's a short sequence it's best if you use the `blastn` algorithm instead of the `megablast` one. You can also decrease the word size to 7 to increase sensitivity.

Here's your search, it'll be available until october 31:
https://blast.ncbi.nlm.nih.gov/Blast...ID=VGW43RPD014

According to the BLAST results, the sequence matches chr2:29446798-29446833 (hg19).

...PS anyways the CIGAR string says "36M" but the alignment to GRCh37.p13 shows one mismatch.

...PPS oh, I just noticed - I know where you got it wrong! Bit 16 in the SAM flag does not simply mean "the read is aligned to the reverse strand", it means "the read is represented as its reverse-complement because it aligned to the reverse strand". Sequences in SAM files are always represented on the forward strand. So if bit 16 is 1, as in this case, it means that "CCCC..." is ALREADY the reverse-complement of the actual read. So you don't have to reverse-complement it again. In fact, you can find your sequence in the same screenshot you posted.

radood 10-30-2019 01:37 PM

Wow I do see it now!!! Thank you so very much Rosati. Yes this is super helpful and it makes a lot of sense :) :) :)

I didn't realize that reads in the SAM file always represent the forward strand. How about cases when I check read.get_forward_sequence in pysam, sometimes the output is the reverse complement of the sequence in the read itself, like this example:

Read:
0a023638:0B0B 83 0 9999 0 144M 0 10005 144 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCC array('B', [12, 12, 41, 37, 32, 32, 32, 32, 41, 37, 37, 37, 22, 32, 27, 12, 37, 32, 27, 32, 27, 27, 41, 41, 27, 27, 32, 32, 32, 32, 32, 27, 37, 37, 41, 37, 37, 37, 41, 41, 41, 41, 41, 41, 37, 41, 41, 41, 41, 41, 41, 37, 41, 41, 37, 41, 41, 27, 37, 41, 41, 37, 41, 37, 41, 41, 41, 41, 37, 32, 41, 32, 41, 37, 32, 22, 41, 41, 41, 41, 41, 37, 41, 41, 41, 41, 37, 27, 41, 41, 27, 37, 37, 12, 41, 41, 41, 37, 41, 37, 41, 41, 41, 37, 37, 27, 41, 37, 22, 37, 37, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 37, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 37, 41, 37, 27]) [('MD', '0A143'), ('RG', 'HMMGHBBXX.lane0.2P_FMIEx_321'), ('NM', 1), ('AS', 143), ('XS', 137)]

But read.get_forward_sequence() is:
GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG

Thank you so much!


All times are GMT -8. The time now is 11:28 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.