Seqanswers Leaderboard Ad

**dgscofield** · 12-15-2015, 04:59 AM

I consider the BWA parameters to be worth tweaking if that is what is required to get "reasonable" alignments of most reads, because of sequencing technology, library characteristics, etc. If mapping rates are reasonably high, then BWA is already considering several choices during alignment and the mapping-quality value is meant to describe this probability you are seeking.

With the older aln/sampe pipeline, I wanted 'stringent' mappings like you are seeking, so I restricted seed mismatches, but I soon realised this is a biased view of stringency (and reliance on a fixed seed creates biased mapping, one of the advantages of mem). As you have already found by considering just these options, when attempting to filter during mapping by tweaking parameter values, there are so many different cases that could potentially occur, that you cannot address them all, and the parameter changes are likely to make unusual modifications to mapping that introduce further complications.

Note that the presence of clips does not necessarily mean a read is poorly mapped.

If you only want uniquely-mapping reads with high mapping quality, that are properly paired with all mates mapped to the same scaffold, then standard samtools post-alignment filtering on mapping quality and on SAM flags should serve you well.

**Jane M** · 12-15-2015, 06:55 AM

Thank you for your answer.
If I do not change the alignment parameters, I can do post-alignment filtering as you suggested.

Mapping score
With samtools view -q 20

"Multireads"
Removal of reads with AS=XS (for those having mapping score>=20)
Example in one of my samples:

Code:

K00103:19:H2MFLBBXX:4:1101:3953:1598	83	chr9	67316168	24	29S46M	=	67316086	-128	GAAACATCCTTGTGAGGTGTGCACTGAAGTCACAGAGTTGAAACTGTCT
TTTGATTCAGCAGTTTTGAATCTCTC	KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFAAA	NM:i:2	MD:Z:14A1A29	AS:i:36	XS:i:36	RG:Z:WES

Removal of all reads with "AS:i:X"=="XS:i:X" (command line)

Clipped reads
From what I understand, clipped reads come either from structural variations or from chimeric reads.
Since I won't study structural variations for the moment, I wonder if I can remove them all.
Removal of all reads containing "S" or "H" in field 6 of SAM file. (command line)

Pairs aligned on different locations
Removal of reads with pair on different locations in field 7 of SAM file. (command line)
Example in one of my sample:

Code:

[merlevede@login01 Sample_X]$ cat file.sam | cut -f 7 | sort | uniq -c
     27 
219456370 =
  85812 *
 116175 chr1
  80850 chr10
  73198 chr11
  74248 chr12
  37966 chr13
  47290 chr14
  47084 chr15
  52214 chr16
  55936 chr17
  31242 chr18
  50031 chr19
 128166 chr2
  32914 chr20
  20887 chr21
  22450 chr22
  96852 chr3
  86947 chr4
  81397 chr5
  80578 chr6
  78095 chr7
  65246 chr8
  63369 chr9
   3414 chrM
  39836 chrX
  23048 chrY

Keep only reads with "=" in field 7 of SAM file.

What do you think about this strategy?
Jane

**dgscofield** · 12-15-2015, 12:17 PM

Seems reasonable if stringency is what you are after. You might be able to also make use of samtools view flags that include (-f) or exclude (-F) certain SAM flags. It's convenient to use https://broadinstitute.github.io/pic...ain-flags.html to figure out which values to use.

Clipped reads can also occur with biological causes, like indels, or from sequencing, e.g., incomplete adapter removal.

**Jane M** · 12-18-2015, 01:40 AM

Thank you for your answer.

I ran some tests to check the percentage of reads with:
- MAPQ<20: 2.5-2.7%
- either soft or hard clipping: 0.50-0.64%
- soft clipping: 0.49-0.62%
- hard clipping: 0.008-0.01%
- insertion: 0.18-0.29%
- deletion: 0.22-0.31%

The mapping quality seems the most important parameter. I am testing some lower MAPQ scores and if the proportion of soft and hard clipped reads diminish when removing low MAPQ reads.

**dgscofield** · 12-18-2015, 02:24 AM

So that shows characteristics of reads left after removing MAPQ<20? To me those look like reasonable ranges likely arising from biological and/or sequencing causes. As you decrease MAPQ I would expect relative proportions (e.g., in vs del, soft vs hard) to remain about the same.

To dig a bit deeper into clipping if you are particularly worried about that, I would check the distribution of clip lengths. I would expect them to be quite short, since this is after trimmomatic.

**Jane M** · 12-18-2015, 03:10 AM

Originally posted by dgscofield View Post

So that shows characteristics of reads left after removing MAPQ<20?

That are the characteristics of the "raw" SAM file obtained after default alignment.
Currently I am removing reads with MAPQ<20 to check the other characteristics (clipping, indel and mapping). I should have the results by tomorrow for all my samples, I will let you know in case someone else is interested.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Confident mapping with BWA mem using -A, -C, -L, -U options

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News