Seqanswers Leaderboard Ad

**thinkRNA** · 05-26-2010, 09:18 PM

Originally posted by jwfoley View Post

Don't remove duplicate reads because they're duplicates; that defeats the purpose of quantification. Filter them out by the posterior probability that their best alignment is correct. That's what the MAPQ field in SAM format is supposed to contain, although many mappers ignore the spec so you might have to calculate it yourself. If you do that, the "stacks" go away on their own.

Hi jwfoley, can you please explain how to calculate the posterior probability that the alignment is correct or not? What is the threshold to use and moreover, how does one interpret the MAPQ field in the SAM format. I will really appreciate your reply!

Thanks in advance.

**Michael.James.Clark** · 05-27-2010, 12:23 PM

Originally posted by dukevn View Post

I wonder if this should be applied for Illumina-produced data or not? Should I filter out the identical (repeated) reads before doing any analysis?

Thanks,

D.

It can definitely happen in Illumina data. PCR duplicates are an issue in both Illumina and SOLiD platforms.

That said, I think you have to consider your experiment before deciding if duplicate removal is for you.

Keep in mind that duplicates are generally thought to be due to PCR rather than independent observations of the same positions. This assumption is based on the state where coverage is much lower than read length. Consider the chances that your particular experiment will have duplicate reads by chance.

Two examples:

A 25-base single-end genomic library. Once you have 25x coverage, you're in a range of coverage where it's very likely to see duplicates that are independent observations, not due to PCR. I might apply a statistical model to the 25-base single-end library to figure out how likely duplicates are and then only filter if the number of duplicates is significantly outside that distribution.

A 50-base mate-pair genomic library. Each read here contains 100 bases and has 1200-1500 bases between each end. In this library, you can exceed 100x coverage and still be fairly confident that duplicates are generally not independent observations, but due to PCR. If I'm at 25X coverage, I feel confident that consolidating duplicate reads is not losing me a significant amount of relevant data--the vast majority of duplicates in this library are PCR duplicates.

As for the specific case of RNAseq here, I think pmiguel contributed some valuable information on how to avoid RNAseIII-related biases that's worth looking at (sorry, I do whole genome, generally, so my advice here is more general).

As for algorithms for removing duplicates, Picard MarkDuplicates is adequate and does the job thoroughly and correctly. It is pretty much "standard operating procedure" for our genomic analysis team.

Edit: Let me clarify something in case it's unclear. For the example 25-base single-end genomic library, I would not do blind duplicate removal. Even at, say, 10 or 15x coverage, there's a very good chance duplicates are independent observations in such a library, so removing all duplicates would be removing a lot of real, useful data.

**nilshomer** · 05-27-2010, 12:27 PM

Duplicate removal always reminds me of the Birthday problem. lh3 has a good white-paper on the theory behind duplicate removal.

**jwfoley** · 05-28-2010, 12:50 PM

Originally posted by thinkRNA View Post

Hi jwfoley, can you please explain how to calculate the posterior probability that the alignment is correct or not? What is the threshold to use and moreover, how does one interpret the MAPQ field in the SAM format. I will really appreciate your reply!

Thanks in advance.

Say a given tag aligns to a single site with zero mismatches, but aligns to five sites with one mismatch each. It fits the definition of a unique best hit, but there's actually a reasonable chance it came from any of those other five sites and had a single base-calling error. So the posterior probability that the best alignment is actually correct is P[zero base errors] / (P[zero base errors] + 5 * P[one base error]), if we ignore other possible alignments beyond one mismatch (and probabilities of multiple base errors become vanishingly small, so this is safe, though I usually carry it out to two mismatches since mappers tend to provide those).

The simplest way to model base errors is binomial: each base in your read has an equal, independent probability of being correct or incorrect. Then if your reads are 25 bases long and the error rate is 1%, the probability of a read with zero errors is 0.78, and the probability of a read with one error is 0.20, so for the example above, the posterior probability that the best alignment is correct would be 0.44, which is low.

Of course it would make even more sense to use base quality scores instead of treating all errors as equally likely, if you have that information available. I haven't tried that because I typically deal with Eland and Corona Lite output, which doesn't include the base qualities like SAM does. Likewise, it would be better to consider insertion and deletion errors (which BFAST provides), not just mismatches. But even using the simple binomial model, I find that a threshold of 0.9 removes the vast majority of these "stacks," at least in samples that aren't PCR-bottlenecked, because most of these are actually artifacts of mapping rather than PCR.

The SAM spec includes a "MAPQ" field that's supposed to contain a Phred-scale posterior probability score for each alignment, though it doesn't say how to calculate the score (nor should it). Assuming your mapper puts something useful in here, you can just parse it out, convert out of Phred scale, and filter on that (or convert your threshold into Phred scale). Of course, some mappers (e.g. Bowtie) essentially ignore the spec, so in that case you'll have to count multiple alignments of each read and calculate the probability yourself.

**JohnK** · 10-29-2010, 07:36 AM

why doesn't samtools rmdup fit the bill? Is this an rna-seq specific problem? Also, I didn't see any posts in regard to this: you have one tag mapping to multiple splice-junctions. Will any duplicate-removing program pick this up and/or would you keep it and consider it as a possible isoform (event) if it maps to the same gene. what if different genes? thanks.

**carmeyeii** · 10-04-2012, 01:52 PM

why doesn't samtools rmdup fit the bill? Is this an rna-seq specific problem?

I believe this is due to samtools rmdup only detecting duplicate reads within the same chromosome, not along the whole genome???

**pmiguel** · 10-05-2012, 05:41 AM

Just to reiterate: blithely reducing all reads that start at a given position in a transcript down to a single count is completely bogus. That is, if you are doing that you are presuming that the only way that a two reads map to exactly the same start is as a result of PCR duplication of a single cDNA fragment.

What about multiple original cDNA fragments that just happen to start at the same position in the transcript?

Upthread there is link to statistical methods to assess the numbers of PCR duplicates. Although an improvement over just removing all start-site-duplicates, I think there are still issues. There are probably 20 different ways to construct an RNAseq library. Each probably has a different start-site bias profile. For example, a large fraction of the methods would fragment the RNA into a useful size range. Typical RNA fragmentation methods include:

(1) Chemical -- usually a divalent cation (Mg++ or Zn++ are commonly used) + heat. Many variations as to temperature and time.

(2) Enzymatic. Sometimes highly biased enzymes are used (eg RNAseIII) or less biased (eg S1 nuclease).

(3) Physical. Some old school protocols sonicate the RNA.

Next, 1st and 2nd strand synthesis can add an another layer of bias. Infamously, Illumina TruSeq kits will show start site bias in the first 12 or so bases of a read end. (Strange because random 6-mers are used to prime synthesis of both strands.)

Sequence-specific differential adapter ligation efficiency might be another source of bias, although it appears to be minor compared to other sources of site bias. So one of the least start-site biased methods would be to construct full length double stranded cDNA followed by mechanical fragmentation (sonication). More arduous to construct libraries this way, however.

--
Phillip

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, Today, 06:35 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, Yesterday, 02:46 PM	0 responses 18 views 0 likes	Last Post by seqadmin Yesterday, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 17 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News