Originally posted by pbseq
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
-
Originally posted by pbseq View PostBTW, for differential expression purposes, I use SeQmonk for harvesting raw data as follows: I select probes of interest (e.g, genes, mRNA or intergenic regions ) , I count data by bases (I do not correct for number of total reads, or gene length and don't log transform) and then feed the raw data to DESeq or EDGER. Upto looks fine to me (at least for my poor experience ).. any warnings?
Comment
-
Simon Andrews an Simon Anders : many thanks for the answers, now the issue is much clearer. Indeed, in many tutorials simply the word "counts" is used which indeed as I understand now is intended to solely refer to "read counts". I wonder also if, in future, packages and software could implement the option of DEG calculation based on base coverage, which in some cases may reveal more accurate !
Comment
-
confuse newbie
Hi guys,
I have trancriptome data from Illumina and am using CLC Genomic workbench for data analysis. I dont know or not familiar with other programs for transcriptome analysis. the data are from 1 sample of control cells and 1 sample of treated cells (no replicate for each sample) and I am looking for differently express genes.
The problem is normalization step. There are 3 types of normalization method offered by the software 1) scaling [option for normalization value= mean or median, baseline = median mean or median median] 2) quantile 3) total reads per 1million.
I dont know which one to choose..T_T Help me..
Then there are statistical tests on Gaussion data or on proportions. How to know that my data is suitable for which test..? I read that mostly people use Baggerley's.
the thing with Baggerley test (when i explore with the software) is that the test outcome have p-value and false discovery rate (FDR) p-value correction. which one is used for determining differentially expressed genes..?
Thank you.
Comment
-
Hi,
I am currently working with SAGE data of multiple conditions. I want to analyze the data with the R package baySeq. Before that I want to normalize the data with TMM (edgeR). Do I need to divide the count data by the normalization factor or can I just substitute the library size by the effective library size for the use in baySeq?
Thanks,
Elena
Comment
-
Originally posted by syambmed View PostHi guys,
I have trancriptome data from Illumina and am using CLC Genomic workbench for data analysis. I dont know or not familiar with other programs for transcriptome analysis. the data are from 1 sample of control cells and 1 sample of treated cells (no replicate for each sample) and I am looking for differently express genes.
The problem is normalization step. There are 3 types of normalization method offered by the software 1) scaling [option for normalization value= mean or median, baseline = median mean or median median] 2) quantile 3) total reads per 1million.
I dont know which one to choose..T_T Help me..
Then there are statistical tests on Gaussion data or on proportions. How to know that my data is suitable for which test..? I read that mostly people use Baggerley's.
the thing with Baggerley test (when i explore with the software) is that the test outcome have p-value and false discovery rate (FDR) p-value correction. which one is used for determining differentially expressed genes..?
Thank you.
I do not now what you might mean by the "Baggerly test". Does your software reference the paper describing it?
The first that comes to my mind regarding a test suitable for RNA-Seq differential expression analysis that I associate with Keith Baggerly is the one described in the following 2003 paper:
Baggerly KA, Deng L, Morris JS, Aldaz CM. Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics 19(12):1477-83, 8/2003.
Incidentally, this is one of the first papers to criticise that RNA-Seq (or back then, SAGE) assays routinely ignore the fact that an analysis without replicate samples cannot be used to derive reliable conclusions.
So, in the end, it does not matter what you do, as without replicates, you will not get far anyway. (See numerous posts in earlier threads on the matter of replicates.)
Comment
-
Originally posted by Simon Anders View PostThe normalization methods in DESeq and edgeR are meant to be fed with raw, integer counts. Please do not divide by transcript length before the DE analysis; it will screw up the whole method. For visualization purposes, you may want to divide the normalized counts by transcript length afterwards. (In DESeq, you get normalized counts by dividing the raw counts by the appropriate size factor.) However, think carefully about what to use as transcript length The original idea of using the sum of all exon lengths was not that good (see, e.g., the cufflinks paper).
i want to use TMM method to normalization,but i encounter a question ,how can i get the normalized counts after TMM ,thank you very much.
Comment
-
Originally posted by Simon Anders View PostNo, it doesn't, because it doesn't need to.
This is why I asked what you want to do with your data.
If you want to test for differential expression, you want to compare the expression of the same gene in different samples. As the gene has the same length in all your samples, there is no point in dividing by the gene length. You only mask the information on how precise your measurement is.
If you want to compare a gene with another gene, then you may want to divide by gene length, but you should be aware that such a comparison opens a whole new can of worms.
About the necessary of the gene length normalization, the following two papers [1,2] have give explicit explanation. At the same expression level, the longer gene will produce more reads. Take two genes A and B for example, the lengths of them are 1kb and 2kb respectively, the expression counts of them at two samples are 100 and 200 for gene A, while 200 and 400 for gene B, obviously they expressed at the same level. So, the expected significant p value should be same for these two genes across the two samples. But, if you don't normalize the counts by the gene length, the p value will be more significant for gene B, because it has more reads count although they have the same fold change. The same to the library size difference. Genes with larger library size will have more counts, which will make the p value more significant. Is this make sense?
So, in my opinion, the counts table feed to DESeq or edgeR should be normalized by gene length and library size.
1. Oshlack, A. and M. J. Wakefield (2009). "Transcript length bias in RNA-seq data confounds systems biology." Biol Direct 4: 14.
2. Gao, L., et al. (2011). "Length bias correction for RNA-seq data in gene set analyses." Bioinformatics 27(5): 662-669.Last edited by pengchy; 05-11-2013, 09:33 PM.
Comment
-
Of course, the longer gene will have the lower p value. You seem to be under the impression that equal fold changes should lead to equal p values. However, the p values informs you about the strength of evidence against the null hypothesis of equal expression strength -- and for two genes with the same fold change, the evidence is stronger for the one with more counts, and hence the p value should be lower.
The papers you cite merely point out that a naive use of gene-set enrichment analysis methods on such p values give biased results. In essence, such methods expect, as input, a measure of effect strength, and if you give them something that is confounded with inferential power, you get problems. This was always clear, but in microarray times, nobody cared because there, inferential power does not depend as strongly on expression strength.
By the way, it's the dependence on read count that matters here, and read count is determined by both expression strength and gene length, with the latter being the smaller contribution. This is why I feel that two papers focus a bit on the wrong aspect of the issue.Last edited by Simon Anders; 05-11-2013, 10:12 PM.
Comment
-
Hi Simon,
Thank you for your reply.
I agree with you that the p values should be lower for the genes with more counts when they have the same expression level because they are longer. So, the p value of the genes with differnt length will be uncomparable if the counts were not normalized by gene length. This is my point.
Comment
-
Hence, the power to detect differential expression depends strongly on
the count, and the count in turn depends on two things, namely (i) the
expression strength (say, averaged over both conditions) and (ii) the
gene length (because longer genes give rise to more fragments at the
same expression level).
In a subsequent analysis looking, e.g., for enrichment in gene
categories, this causes bias. However, this bias should not and cannot
be dealt with by the method to test for differential expression. It
should, however, be taken into account by the enrichment test.
When adjusting such a test, I would suggest to use directly the count
level as input, and not the transcript length, as the latter is only
half of the story.
The quoted message is extracted from your reply to BioC maillist for exactly the same question as I proposed: https://stat.ethz.ch/pipermail/bioco...st/035137.html
I still can not understand why it not need to do the gene length normalization at the differential expression detection step. If this bias indeed exist, the p value of the differential expression will be not credible. How can this bias be reduced at DE detection step, instead of being passed to the enrichment analysis step?
Comment
-
Hi dpryan,
Sorry, I can't catch your meaning.
The pvalue of adjusted pvalue will be used to detect differentially expressed genes. If the different p value is caused by the gene length, the order of p value will misguid the following biology experiments.
Thank you.Last edited by pengchy; 05-12-2013, 06:48 AM.
Comment
Latest Articles
Collapse
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
-
by seqadmin
Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...-
Channel: Articles
03-22-2024, 06:39 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
23 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
||
Started by seqadmin, 04-10-2024, 09:21 AM
|
0 responses
18 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 09:21 AM
|
||
Started by seqadmin, 04-04-2024, 09:00 AM
|
0 responses
49 views
0 likes
|
Last Post
by seqadmin
04-04-2024, 09:00 AM
|
Comment