There's no 'XT:A:U' tag for bwa-mem outputs. How can I get the uniquely mapped reads?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Keep in mind that "uniqeness" is not a clear cut variable. Rather is a probability that the position of the read is wrong. Specifically, p_wrong= 10^(-mapq/10).
So for mapq 10 you have p_wrong= 0.1, for mapq 20 p_wrong= 0.01. In practice I see that most of the reads are either mapq 0 or mapq pretty high (>30) with little in between, so the exact threshold makes little difference. But this depends of course on your study system.
Comment
-
Originally posted by dariober View PostKeep in mind that "uniqeness" is not a clear cut variable. Rather is a probability that the position of the read is wrong. Specifically, p_wrong= 10^(-mapq/10).
So for mapq 10 you have p_wrong= 0.1, for mapq 20 p_wrong= 0.01. In practice I see that most of the reads are either mapq 0 or mapq pretty high (>30) with little in between, so the exact threshold makes little difference. But this depends of course on your study system.
I have checked many resources and I know it's hard to definite the 'unique' reads. So is it secure to take the down-stream analysis on the reads with mapq > 30? Or can I make a more filter to choose reads from the mapq>30 reads?
Comment
-
Originally posted by Yamol View PostSo is it secure to take the down-stream analysis on the reads with mapq > 30? Or can I make a more filter to choose reads from the mapq>30 reads?
If you filter to remove anything with mapq<=3 you will get only reads that the aligner considers as having one mapping location that is better than any other.
Comment
-
Originally posted by Brian Bushnell View PostNo, that will eliminate huge numbers of reads that have variations, or errors, or are mapped to somewhat repetitive areas, and generally eliminate coverage from a lot of places and give a severe ref-bias everywhere else, so the results will not be trustworthy.
If you filter to remove anything with mapq<=3 you will get only reads that the aligner considers as having one mapping location that is better than any other.
Anyway, as I said above I find it makes little difference the exact threshold between say 5 and 20. These are the number of reads mapped to mouse chr18 (bowtie2), read length 75, it's a pull-down experiment:
Code:samtools view -c -q0 grm056_i1_KO_carcass.bam 18 8184447 samtools view -c -q1 grm056_i1_KO_carcass.bam 18 8039114 samtools view -c -q3 grm056_i1_KO_carcass.bam 18 7838063 samtools view -c -q5 grm056_i1_KO_carcass.bam 18 7816288 samtools view -c -q10 grm056_i1_KO_carcass.bam 18 7707885 samtools view -c -q20 grm056_i1_KO_carcass.bam 18 7635921 samtools view -c -q30 grm056_i1_KO_carcass.bam 18 7313124 samtools view -c -q40 grm056_i1_KO_carcass.bam 18 7106571
Comment
-
Originally posted by dariober View PostI agree that going above q30 is probably unnecessarily harsh. However there might be cases where you prefer to loose coverage in some regions rather than accepting uncertain read positions.
Isn't <=3 a little too relaxed though? It means that a read with q4 has 10^(-4/10) ~ 0.4 chance to be wrong.
Anyway, as I said above I find it makes little difference the exact threshold between say 5 and 20. These are the number of reads mapped to mouse chr18 (bowtie2), read length 75, it's a pull-down experiment:
Code:samtools view -c -q0 grm056_i1_KO_carcass.bam 18 8184447 samtools view -c -q1 grm056_i1_KO_carcass.bam 18 8039114 samtools view -c -q3 grm056_i1_KO_carcass.bam 18 7838063 samtools view -c -q5 grm056_i1_KO_carcass.bam 18 7816288 samtools view -c -q10 grm056_i1_KO_carcass.bam 18 7707885 samtools view -c -q20 grm056_i1_KO_carcass.bam 18 7635921 samtools view -c -q30 grm056_i1_KO_carcass.bam 18 7313124 samtools view -c -q40 grm056_i1_KO_carcass.bam 18 7106571
Comment
-
Originally posted by dariober View PostIsn't <=3 a little too relaxed though? It means that a read with q4 has 10^(-4/10) ~ 0.4 chance to be wrong.
Thanks for sharing the empirical results of filtering with various thresholds! But, I'd like to point out that the scalar result of reads remaining does not necessarily reflect the damage potentially done to the analysis results. As an example, a typical human has ~3 million mutations compared to the reference, or around a 0.1% rate. With 100bp error-free reads, then, about 90% of them would be expected to map to the reference with 0 mismatches. If you set your quality threshold high enough, you can will eliminate all of the reads that indicate a variation... but still retain 90% of your reads, which looks pretty good!
Comment
-
Originally posted by Brian Bushnell View PostThe question was specifically about multimapping, not mismapping.
Thanks for sharing the empirical results of filtering with various thresholds! But, I'd like to point out that the scalar result of reads remaining does not necessarily reflect the damage potentially done to the analysis results. As an example, a typical human has ~3 million mutations compared to the reference, or around a 0.1% rate. With 100bp error-free reads, then, about 90% of them would be expected to map to the reference with 0 mismatches. If you set your quality threshold high enough, you can will eliminate all of the reads that indicate a variation... but still retain 90% of your reads, which looks pretty good!
Comment
-
Originally posted by dariober View PostAs a thought experiment, imagine a region of 100bp with several SNP relative to the reference, say 20. In order to map a read there you need to lower the mapq score. This could mean that a read truly coming from that region could map even better somewhere else and reads truly mapping somewhere else could end up in this critical region.
It's always possible to take a read from some region, apply a bunch of mutations to it, and have the resulting read map perfectly to somewhere else, while having too many differences to map to the correct location at all. These events are very unlikely and tend to be ignored; increasing the mapq threshold will not protect you from them. Of course, there's always some gray area in between. But generally, the higher the mapq threshold, the more reference-bias you will get, which greatly interferes with variant calling.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...-
Channel: Articles
Today, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
37 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
41 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
||
Started by seqadmin, 04-10-2024, 09:21 AM
|
0 responses
35 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 09:21 AM
|
||
Started by seqadmin, 04-04-2024, 09:00 AM
|
0 responses
54 views
0 likes
|
Last Post
by seqadmin
04-04-2024, 09:00 AM
|
Comment