Seqanswers Leaderboard Ad

**simonandrews** · 03-13-2011, 11:53 PM

Originally posted by pbseq View Post

maybe sligthly off topic but is RNA-seq counting-related:
I always hear about RPKM but, to me, counting gene expressione by covered bases (and not nymber of reads ) looks more precise to me. base counting instead of read counting is very easy (e.g. with SeqMonk software) but is soo poorly mentioned that I'm wondering if it's OK for downstream applications.

If your reads are all the same length then counting reads or bases amounts to the same thing. The reason the base counting option in SeqMonk is useful is because in the initial analysis a quantitation is usually carried out for each exon. Spliced reads will be split between exons, so if you simply count reads then a spliced read will be counted twice and thus any spliced reads will exert an undue influence on the quantitation. Doing a base count allows a read to be split proportionally between the exons it overlaps and therefore gives every read the same weight in the quantitation.

**Simon Anders** · 03-14-2011, 02:34 AM

Originally posted by pbseq View Post

BTW, for differential expression purposes, I use SeQmonk for harvesting raw data as follows: I select probes of interest (e.g, genes, mRNA or intergenic regions ) , I count data by bases (I do not correct for number of total reads, or gene length and don't log transform) and then feed the raw data to DESeq or EDGER. Upto looks fine to me (at least for my poor experience ).. any warnings?

Yes, you get too many false positives. There is two principal sources of noise in RNA-Seq: (a) the actual variation in concentration of a transcript between samples, and (b) the shot noise. Imagine, you have two samples with exactly the same concentration of a given transcript. Will you get the same counts? Of course not, because there is still a random element in how many transcripts actually get sequenced. This is called shot noise, is determined by the Poisson distribution, and constitutes the theoretical lower limit the the measurement precision in RNA-Seq. DESeq and edgeR compute the shot noise from the number of reads. (The more reads, the less severe the shot noise is.) If you feed these tools the number of bases instead of the number of reads, they will severely underestimate the within-group variance and call too many hits.

**pbseq** · 03-15-2011, 06:22 AM

Simon Andrews an Simon Anders : many thanks for the answers, now the issue is much clearer. Indeed, in many tutorials simply the word "counts" is used which indeed as I understand now is intended to solely refer to "read counts". I wonder also if, in future, packages and software could implement the option of DEG calculation based on base coverage, which in some cases may reveal more accurate !

**slny** · 03-16-2011, 09:59 AM

Could I simply summarize that RPKM should be used for changes with genes, but TMM in edgeR and the normalization (Sorry, don't remember the name) in DESeq for changes with samples?

**syambmed** · 04-10-2011, 09:17 PM

confuse newbie

Hi guys,

I have trancriptome data from Illumina and am using CLC Genomic workbench for data analysis. I dont know or not familiar with other programs for transcriptome analysis. the data are from 1 sample of control cells and 1 sample of treated cells (no replicate for each sample) and I am looking for differently express genes.

The problem is normalization step. There are 3 types of normalization method offered by the software 1) scaling [option for normalization value= mean or median, baseline = median mean or median median] 2) quantile 3) total reads per 1million.

I dont know which one to choose..T_T Help me..

Then there are statistical tests on Gaussion data or on proportions. How to know that my data is suitable for which test..? I read that mostly people use Baggerley's.

the thing with Baggerley test (when i explore with the software) is that the test outcome have p-value and false discovery rate (FDR) p-value correction. which one is used for determining differentially expressed genes..?

Thank you.

**edue** · 11-03-2011, 03:17 AM

Hi,

I am currently working with SAGE data of multiple conditions. I want to analyze the data with the R package baySeq. Before that I want to normalize the data with TMM (edgeR). Do I need to divide the count data by the normalization factor or can I just substitute the library size by the effective library size for the use in baySeq?

Thanks,
Elena

**Simon Anders** · 11-04-2011, 12:11 PM

Originally posted by syambmed View Post

Hi guys,

I have trancriptome data from Illumina and am using CLC Genomic workbench for data analysis. I dont know or not familiar with other programs for transcriptome analysis. the data are from 1 sample of control cells and 1 sample of treated cells (no replicate for each sample) and I am looking for differently express genes.

The problem is normalization step. There are 3 types of normalization method offered by the software 1) scaling [option for normalization value= mean or median, baseline = median mean or median median] 2) quantile 3) total reads per 1million.

I dont know which one to choose..T_T Help me..

Then there are statistical tests on Gaussion data or on proportions. How to know that my data is suitable for which test..? I read that mostly people use Baggerley's.

the thing with Baggerley test (when i explore with the software) is that the test outcome have p-value and false discovery rate (FDR) p-value correction. which one is used for determining differentially expressed genes..?

Thank you.

I don't know the Genomic Workbench, but your post illustrates precisely the issues I have with these software suites. They give you easy access to many different methods published in the literature and give the you the illusion that you could perform a sound analysis without having to read all the papers describing, discussing and comparing these methods.

I do not now what you might mean by the "Baggerly test". Does your software reference the paper describing it?

The first that comes to my mind regarding a test suitable for RNA-Seq differential expression analysis that I associate with Keith Baggerly is the one described in the following 2003 paper:

Baggerly KA, Deng L, Morris JS, Aldaz CM. Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics 19(12):1477-83, 8/2003.

Incidentally, this is one of the first papers to criticise that RNA-Seq (or back then, SAGE) assays routinely ignore the fact that an analysis without replicate samples cannot be used to derive reliable conclusions.

So, in the end, it does not matter what you do, as without replicates, you will not get far anyway. (See numerous posts in earlier threads on the matter of replicates.)

**luoye** · 12-27-2012, 07:00 PM

hi,everyone
i want to use TMM method to normalization,but i encounter a question ,how can i get the normalized counts after TMM ,thank you very much.

**luoye** · 12-27-2012, 07:07 PM

Originally posted by Simon Anders View Post

The normalization methods in DESeq and edgeR are meant to be fed with raw, integer counts. Please do not divide by transcript length before the DE analysis; it will screw up the whole method. For visualization purposes, you may want to divide the normalized counts by transcript length afterwards. (In DESeq, you get normalized counts by dividing the raw counts by the appropriate size factor.) However, think carefully about what to use as transcript length The original idea of using the sum of all exon lengths was not that good (see, e.g., the cufflinks paper).

hi,everyone
i want to use TMM method to normalization,but i encounter a question ,how can i get the normalized counts after TMM ,thank you very much.

**pengchy** · 05-11-2013, 09:30 PM

Originally posted by Simon Anders View Post

No, it doesn't, because it doesn't need to.

This is why I asked what you want to do with your data.

If you want to test for differential expression, you want to compare the expression of the same gene in different samples. As the gene has the same length in all your samples, there is no point in dividing by the gene length. You only mask the information on how precise your measurement is.

If you want to compare a gene with another gene, then you may want to divide by gene length, but you should be aware that such a comparison opens a whole new can of worms.

Hi Simon,

About the necessary of the gene length normalization, the following two papers [1,2] have give explicit explanation. At the same expression level, the longer gene will produce more reads. Take two genes A and B for example, the lengths of them are 1kb and 2kb respectively, the expression counts of them at two samples are 100 and 200 for gene A, while 200 and 400 for gene B, obviously they expressed at the same level. So, the expected significant p value should be same for these two genes across the two samples. But, if you don't normalize the counts by the gene length, the p value will be more significant for gene B, because it has more reads count although they have the same fold change. The same to the library size difference. Genes with larger library size will have more counts, which will make the p value more significant. Is this make sense?

So, in my opinion, the counts table feed to DESeq or edgeR should be normalized by gene length and library size.

1. Oshlack, A. and M. J. Wakefield (2009). "Transcript length bias in RNA-seq data confounds systems biology." Biol Direct 4: 14.
2. Gao, L., et al. (2011). "Length bias correction for RNA-seq data in gene set analyses." Bioinformatics 27(5): 662-669.

**Simon Anders** · 05-11-2013, 10:09 PM

Of course, the longer gene will have the lower p value. You seem to be under the impression that equal fold changes should lead to equal p values. However, the p values informs you about the strength of evidence against the null hypothesis of equal expression strength -- and for two genes with the same fold change, the evidence is stronger for the one with more counts, and hence the p value should be lower.

The papers you cite merely point out that a naive use of gene-set enrichment analysis methods on such p values give biased results. In essence, such methods expect, as input, a measure of effect strength, and if you give them something that is confounded with inferential power, you get problems. This was always clear, but in microarray times, nobody cared because there, inferential power does not depend as strongly on expression strength.

By the way, it's the dependence on read count that matters here, and read count is determined by both expression strength and gene length, with the latter being the smaller contribution. This is why I feel that two papers focus a bit on the wrong aspect of the issue.

**pengchy** · 05-11-2013, 10:42 PM

Hi Simon,

Thank you for your reply.

I agree with you that the p values should be lower for the genes with more counts when they have the same expression level because they are longer. So, the p value of the genes with differnt length will be uncomparable if the counts were not normalized by gene length. This is my point.

**pengchy** · 05-11-2013, 11:27 PM

Hence, the power to detect differential expression depends strongly on
the count, and the count in turn depends on two things, namely (i) the
expression strength (say, averaged over both conditions) and (ii) the
gene length (because longer genes give rise to more fragments at the
same expression level).

In a subsequent analysis looking, e.g., for enrichment in gene
categories, this causes bias. However, this bias should not and cannot
be dealt with by the method to test for differential expression. It
should, however, be taken into account by the enrichment test.

When adjusting such a test, I would suggest to use directly the count
level as input, and not the transcript length, as the latter is only
half of the story.

Hi Simon,

The quoted message is extracted from your reply to BioC maillist for exactly the same question as I proposed: https://stat.ethz.ch/pipermail/bioco...st/035137.html

I still can not understand why it not need to do the gene length normalization at the differential expression detection step. If this bias indeed exist, the p value of the differential expression will be not credible. How can this bias be reduced at DE detection step, instead of being passed to the enrichment analysis step?

**dpryan** · 05-12-2013, 01:12 AM

Originally posted by pengchy View Post

So, the p value of the genes with differnt length will be uncomparable if the counts were not normalized by gene length. This is my point.

You're misusing p-values, that's the source of the confusion.

**pengchy** · 05-12-2013, 06:13 AM

Hi dpryan,

Sorry, I can't catch your meaning.
The pvalue of adjusted pvalue will be used to detect differentially expressed genes. If the different p value is caused by the gene length, the order of p value will misguid the following biology experiments.

Thank you.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News