Hi Mykhaylo,

I just ran a few tests with your data and it looks like the reason for the poor alignment rates is that your data is riddled with Insertions between bases 120-150.

The general quality towards the 3' end is poor but not not shocking (see the attached FastQC profile).

There are however lots and lots of insertions towards the 3' end (up to 80% for certain positions, see the attached BamQC plot), which is the reason for the poor mapping efficiency. I suspect that something weird might have happened during the run, or maybe it is just some kind of artefact due to the sequence composition? Just briefly looking over it there are at least 10 CTTs and other CCCTTT repeats in the region in question... Alternatively it could of course be the case that the reference genome in that very regions is simply wrong.

Hard trimming the reads to 110bp and Bismark defaults (as in quite strict) already brought the mapping efficiency up to > 80%, allowing more InDels with --score_min L,0,-0.4 brought it up to almost 97%. Just allowing more mismatches on the file as you provided it --score_min L,0,-0.6 also yielded 96% mapping efficiency.

Switching tools is one thing and fine (you can only hope that the data will be clipped), but you need to understand that the data provided (or potentially the genome for the region in question) is flawed.

Cheers, Felix
Attached Images
File Type: png insertions.png (28.6 KB, 10 views)
File Type: png quals.png (72.1 KB, 11 views)
