Seqanswers Leaderboard Ad

**Simon Anders** · 04-07-2011, 01:04 AM

Often in RNA-Seq analyses, normalization is done by simply dividing by the total number of mapped reads from a library. The case that you describe is precisely the reason why this is not a good idea, and why we in the DESeq paper, and independently Oshlack and Robinson in their paper on normalization, advise against it. DESeq's normalization looks at each gene, calculates a normalization factor from this gene's count, and then takes the median of the factors from all the genes. Then, a single gene should have little influence, even if it is very strongly expressed.

Still, it might be interesting to double-check this. Remove the mitochondrial transcripts from your count table and re-run the analysis. I'd hope that your results for all the other gene won't change much.

**kirby** · 04-08-2011, 12:54 PM

I tried removing the offending abundant transcript and running DEseq but the sample which contained 50% reads from the same gene was still an outlier with respect to the other replicates. I also tried removing the outlying sample altogether. I have just 3 replicates per condition.

Anyway it looks like the normalization implemented by DEseq is pretty robust because I got similar lists of differentially expressed genes regardless of whether I ran the analysis using all the replicates, or after removing the outlying sample, or after removing the very abundant transcript.

**carmeyeii** · 12-21-2012, 08:25 AM

Originally posted by kirby View Post

I I'm concerned that this transcript is going to skew the normalization procedure used by DEseq and I wonder if it would be best to remove the counts for this gene before running DEseq? How are people dealing with libraries that have unusually high levels of ribosomal rRNA contamination?

Cheers

Kirby,

I have a very similar problem to yours.

I am analyzing some Illumina libraries that appear to have a lot of ribosomal RNA contamination.

I'm using Bowtie to align the reads only to a specific set of sequences, and because of the differing amount of rRNA contamination in each sample, each of them maps a different percentage of reads to the dataset (some half of what others map), ranging from 1% to 0.3%.

I wonder if the amount of rRNA contamination in the preparation of a library can have an impact on the apparent expression level of a gene -- even though one normalizes its counts agains the total number of reads that mapped.

What's your opinion in this subject?

Carmen

**Simon Anders** · 12-21-2012, 09:21 AM

Originally posted by carmeyeii View Post

I wonder if the amount of rRNA contamination in the preparation of a library can have an impact on the apparent expression level of a gene -- even though one normalizes its counts agains the total number of reads that mapped.

This is a very nice example where a normalization by total number of reads would lead to wrong results while using one of the normalization methods I mention in post #2 will take care of the issue.

**carmeyeii** · 12-21-2012, 09:51 AM

Thanks, Simon.

So the norm factors produced by default in DESeq are indeed calculated in the manner describer above by yourself, I assume?

Carmen

**Simon Anders** · 12-21-2012, 02:14 PM

Originally posted by carmeyeii View Post

So the norm factors produced by default in DESeq are indeed calculated in the manner describer above by yourself, I assume?

Of course.

**carmeyeii** · 12-28-2012, 02:17 PM

Hello again,

I've gone through with the normalization and differential expression analysis for my samples, but it seems I'm still having trouble with the very diverse amount of rRNA contamination, which I suspect may be obscuring DE effects due to very large differences in counts among replicates.

The percentage of reads mapped to the small index of interest from each sample was very different, ranging from 0.2% to .99%, presumably because of the great difference in rRNA content in each library. Because of this, the size factor vectors were very diverse, ranging from 0.4 to 4 in one set of comparisons. Because of the great difference in rRNA contamination, I did not want to normalize by library size, as stated above by the authors of DESeq.

I am also concerned that the normalization used (the default method in DESeq), because it estimates size factors based on the changes in counts of each feature, while assuming that most features are not differentially expressed, will be too conservative if it is the case that most of the features in the present dataset are indeed upregulated.

Unfortunately, I did not find any significantly differentially expressed TEs. Perhaps the library being so contaminated is an obstacle to finding this, or perhaps I could use another normalization method to even out the rRNA contamination among samples?

In short, there is a huge amount (and diversity) of rRNA contamination between samples and the possibility that most features being compared MIGHT be differentially expressed, complicating the analysis a bit.

Below is one of the size factor vectors obtained and a representative histogram of what I'm getting.

Any input on this matter would be greatly appreciated!

Carmen
> cds = estimateSizeFactors( cds )
> sizeFactors(cds)
1 2 3 4
0.7007070 0.4144263 0.7905694 3.9685978

**dietmar13** · 12-28-2012, 07:35 PM

PoissonSeq (SAMseq) normalization

you could try the normalisation method provided in SAMseq (samr-package). It can be used as stand-alone function from the very similar PoissonSeq package (available from CRAN). the usage is simple:

PS.Est.Depth(n, iter=5, ct.sum=5, ct.mean=0.5)

and you could feed the result to DESeq...

perhaps you can post the result here and if this method improved your results.

**carmeyeii** · 01-03-2013, 12:56 PM

I will try this, and post any changes to the results here. thanks dietmar!

**Marianna85** · 03-15-2013, 06:12 AM

Originally posted by carmeyeii View Post

I will try this, and post any changes to the results here. thanks dietmar!

Hi Carmen,
did you obtain better results with this second normalization?
I'm dealing with a similar problem...

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Dealing with super abundant transcripts in RNAseq

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News