Seqanswers Leaderboard Ad

**dariober** · 08-11-2015, 11:27 PM

Code:

samtools view -q 10 -b aln.bam > aln.uniq.bam

**Yamol** · 08-11-2015, 11:47 PM

MAPQ (mapping quality - describes the uniqueness of the alignment, 0=non-unique, >10 probably unique)

I hope it's seriously uniquely mapped read.

**dariober** · 08-12-2015, 12:07 AM

Keep in mind that "uniqeness" is not a clear cut variable. Rather is a probability that the position of the read is wrong. Specifically, p_wrong= 10^(-mapq/10).
So for mapq 10 you have p_wrong= 0.1, for mapq 20 p_wrong= 0.01. In practice I see that most of the reads are either mapq 0 or mapq pretty high (>30) with little in between, so the exact threshold makes little difference. But this depends of course on your study system.

**Yamol** · 08-12-2015, 12:16 AM

Originally posted by dariober View Post

Keep in mind that "uniqeness" is not a clear cut variable. Rather is a probability that the position of the read is wrong. Specifically, p_wrong= 10^(-mapq/10).
So for mapq 10 you have p_wrong= 0.1, for mapq 20 p_wrong= 0.01. In practice I see that most of the reads are either mapq 0 or mapq pretty high (>30) with little in between, so the exact threshold makes little difference. But this depends of course on your study system.

Thanks for your reply. But it seems that I can't say that the reads with high mapq are the reads only mapped to one location in the genome. In my study, I only need the reads that were just mapped to one location and my collaborators don't wanna see it elsewhere. My fastq files are pair-end ones.
I have checked many resources and I know it's hard to definite the 'unique' reads. So is it secure to take the down-stream analysis on the reads with mapq > 30? Or can I make a more filter to choose reads from the mapq>30 reads?

**dpryan** · 08-12-2015, 01:18 AM

You just have to define what you mean by "just mapped to one location". All reads/pairs are multimappers if you allow a high enough edit distance...

**Brian Bushnell** · 08-12-2015, 08:30 AM

Originally posted by Yamol View Post

So is it secure to take the down-stream analysis on the reads with mapq > 30? Or can I make a more filter to choose reads from the mapq>30 reads?

No, that will eliminate huge numbers of reads that have variations, or errors, or are mapped to somewhat repetitive areas, and generally eliminate coverage from a lot of places and give a severe ref-bias everywhere else, so the results will not be trustworthy.

If you filter to remove anything with mapq<=3 you will get only reads that the aligner considers as having one mapping location that is better than any other.

**dariober** · 08-12-2015, 11:22 PM

Originally posted by Brian Bushnell View Post

No, that will eliminate huge numbers of reads that have variations, or errors, or are mapped to somewhat repetitive areas, and generally eliminate coverage from a lot of places and give a severe ref-bias everywhere else, so the results will not be trustworthy.

I agree that going above q30 is probably unnecessarily harsh. However there might be cases where you prefer to loose coverage in some regions rather than accepting uncertain read positions.

If you filter to remove anything with mapq<=3 you will get only reads that the aligner considers as having one mapping location that is better than any other.

Isn't <=3 a little too relaxed though? It means that a read with q4 has 10^(-4/10) ~ 0.4 chance to be wrong.

Anyway, as I said above I find it makes little difference the exact threshold between say 5 and 20. These are the number of reads mapped to mouse chr18 (bowtie2), read length 75, it's a pull-down experiment:

Code:

samtools view -c -q0 grm056_i1_KO_carcass.bam 18
8184447
samtools view -c -q1 grm056_i1_KO_carcass.bam 18
8039114
samtools view -c -q3 grm056_i1_KO_carcass.bam 18
7838063
samtools view -c -q5 grm056_i1_KO_carcass.bam 18
7816288
samtools view -c -q10 grm056_i1_KO_carcass.bam 18
7707885
samtools view -c -q20 grm056_i1_KO_carcass.bam 18
7635921
samtools view -c -q30 grm056_i1_KO_carcass.bam 18
7313124
samtools view -c -q40 grm056_i1_KO_carcass.bam 18
7106571

**Yamol** · 08-12-2015, 11:29 PM

Originally posted by dariober View Post

I agree that going above q30 is probably unnecessarily harsh. However there might be cases where you prefer to loose coverage in some regions rather than accepting uncertain read positions.

Isn't <=3 a little too relaxed though? It means that a read with q4 has 10^(-4/10) ~ 0.4 chance to be wrong.

Anyway, as I said above I find it makes little difference the exact threshold between say 5 and 20. These are the number of reads mapped to mouse chr18 (bowtie2), read length 75, it's a pull-down experiment:

Code:

samtools view -c -q0 grm056_i1_KO_carcass.bam 18
8184447
samtools view -c -q1 grm056_i1_KO_carcass.bam 18
8039114
samtools view -c -q3 grm056_i1_KO_carcass.bam 18
7838063
samtools view -c -q5 grm056_i1_KO_carcass.bam 18
7816288
samtools view -c -q10 grm056_i1_KO_carcass.bam 18
7707885
samtools view -c -q20 grm056_i1_KO_carcass.bam 18
7635921
samtools view -c -q30 grm056_i1_KO_carcass.bam 18
7313124
samtools view -c -q40 grm056_i1_KO_carcass.bam 18
7106571

I think the reason why there's little difference is that your data quality and map quality are hyper good.

**Brian Bushnell** · 08-13-2015, 09:45 AM

Originally posted by dariober View Post

Isn't <=3 a little too relaxed though? It means that a read with q4 has 10^(-4/10) ~ 0.4 chance to be wrong.

The question was specifically about multimapping, not mismapping. If a read has a QV>=3, the mapper thinks it has a >50% chance that the site is correct, therefore there cannot be any other sites to which the read maps equally well. It's true that you will eliminate more incorrectly mapped reads by setting the threshold higher, but there's no reason to do that if the goal is strictly to eliminate multimapped reads.

Thanks for sharing the empirical results of filtering with various thresholds! But, I'd like to point out that the scalar result of reads remaining does not necessarily reflect the damage potentially done to the analysis results. As an example, a typical human has ~3 million mutations compared to the reference, or around a 0.1% rate. With 100bp error-free reads, then, about 90% of them would be expected to map to the reference with 0 mismatches. If you set your quality threshold high enough, you can will eliminate all of the reads that indicate a variation... but still retain 90% of your reads, which looks pretty good!

**dariober** · 08-13-2015, 11:18 PM

Originally posted by Brian Bushnell View Post

The question was specifically about multimapping, not mismapping.

Actually I interpreted the question as asking for surely mapped reads but anyway, let's leave this to the OP, hopefully s/he got the message.

Thanks for sharing the empirical results of filtering with various thresholds! But, I'd like to point out that the scalar result of reads remaining does not necessarily reflect the damage potentially done to the analysis results. As an example, a typical human has ~3 million mutations compared to the reference, or around a 0.1% rate. With 100bp error-free reads, then, about 90% of them would be expected to map to the reference with 0 mismatches. If you set your quality threshold high enough, you can will eliminate all of the reads that indicate a variation... but still retain 90% of your reads, which looks pretty good!

I see your point and I agree that increasing mapq to much becomes detrimental, but I'm not sure that lowering the mapq threshold helps. As a thought experiment, imagine a region of 100bp with several SNP relative to the reference, say 20. In order to map a read there you need to lower the mapq score. This could mean that a read truly coming from that region could map even better somewhere else and reads truly mapping somewhere else could end up in this critical region. (I don't do SNP calling so I may get this wrong...)

**Brian Bushnell** · 08-14-2015, 09:47 AM

Originally posted by dariober View Post

As a thought experiment, imagine a region of 100bp with several SNP relative to the reference, say 20. In order to map a read there you need to lower the mapq score. This could mean that a read truly coming from that region could map even better somewhere else and reads truly mapping somewhere else could end up in this critical region.

That's true, but it should only be a problem if you use an incomplete reference. If the reference is complete the read should map to the correct location.

It's always possible to take a read from some region, apply a bunch of mutations to it, and have the resulting read map perfectly to somewhere else, while having too many differences to map to the correct location at all. These events are very unlikely and tend to be ignored; increasing the mapq threshold will not protect you from them. Of course, there's always some gray area in between. But generally, the higher the mapq threshold, the more reference-bias you will get, which greatly interferes with variant calling.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to extract uniquely mapped reads from bam/sam produced by bwa-mem?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News