Seqanswers Leaderboard Ad

**Brian Bushnell** · 08-17-2017, 10:14 AM

Typically, ambiguously-mapped reads are either given a single random mapping location, or mapped to all locations via secondary alignments. Expression quantification tools understand this.

You can get rid of multimapping reads by filtering out everything with a MAPQ of 3 or less, but that will cause underrepresentation of the expression of homologous genes.

**ammoon** · 08-17-2017, 10:21 AM

thanks for the response

i tried setting the igv threshold for map quality to 50 and i am still getting reads that map to multiple places based on the "blat read sequence" readout table

some reads start in one histone gene and end in another one 90Kb away...?

how do i tell if my reads were given a single random location or mapped to all via secondary alignments?
thank you

**GenoMax** · 08-17-2017, 10:48 AM

It depends on what aligner you used and if it followed the consensus of giving a low MAPQ value to the multi-mapped reads. Brian was referring to default behavior of his BBMap aligner. You can either re-map the data (if you are not sure what your original aligner did) or use one of the suggestions in this Biostars thread to filter your BAM files.

**Brian Bushnell** · 08-17-2017, 11:38 AM

You can also look at the NH tag:

Code:

NH i Number of reported alignments that contains the query in the current record

That's an optional field, but it should tell you how many alignments there are for a read, in the sam file. There's also

Code:

H0 i Number of perfect hits

...but that is much less common.

**aprice67** · 08-22-2017, 12:14 PM

Check out figure 12 and the area around it in this paper.

The quantitative impact of read mapping to non-native reference genomes in comparative RNA-Seq studies

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180904

Sequence read alignment to a reference genome is a fundamental step in many genomics studies. Accuracy in this fundamental step is crucial for correct interpretation of biological data. In cases where two or more closely related bacterial strains are being studied, a common approach is to simply map reads from all strains to a common reference genome, whether because there is no closed reference for some strains or for ease of comparison. The assumption is that the differences between bacterial strains are insignificant enough that the results of differential expression analysis will not be influenced by choice of reference. Genes that are common among the strains under study are used for differential expression analysis, while the remaining genes, which may fail to express in one sample or the other because they are simply absent, are analyzed separately. In this study, we investigate the practice of using a common reference in transcriptomic analysis. We analyze two multi-strain transcriptomic data sets that were initially presented in the literature as comparisons based on a common reference, but which have available closed genomic sequence for all strains, allowing a detailed examination of the impact of reference choice. We provide a method for identifying regions that are most affected by non-native alignments, leading to false positives in differential expression analysis, and perform an in depth analysis identifying the extent of expression loss. We also simulate several data sets to identify best practices for non-native reference use.

It might not be such a good idea to remove them.

**Markiyan** · 08-23-2017, 03:12 AM

Some commandline examples for mapping quality filtering.

Mapping score distribution depends on mapper program/version used.

First check the actual MAPQ score histogram for your bam file (first 1M of reads):

samtools view $1 | head - -n 1000000 | cut -f5 | sort | uniq -c

Than filter with desired threshold using samtools and -q parameter (assuming the input is test.bam):

#!/bin/sh
MAP_QUAL=30
STRAIN=test
samtools view -q $MAP_QUAL -b ${STRAIN}.bam -o ${STRAIN}.Q$MAP_QUAL.bam -U ${STRAIN}.Q${MAP_QUAL}U.bam

The passing MAPQ filter reads will end up in test.Q30.bam

And failing ones would end up in test.Q30U.bam file.

**aprice67** · 08-23-2017, 05:43 AM

Be very careful when dealing with MAPQ scores. It seems every aligner has a different way to determine them and they aren't standardized at all. See here:

More madness with MAPQ scores (a.k.a. why bioinformaticians hate poor and incomplete software documentation) — ACGT

http://www.acgt.me/blog/2015/3/17/more-madness-with-mapq-scores-aka-why-bioinformaticians-hate-poor-and-incomplete-software-documentation

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

how to eliminate reads that map to multiple places in genome

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News