I am working with high-throuput sequencing data and use DEseq to search for diff. exp. genes.
The protocol is somewhat similar to RNA-Seq, but not exactly RNA-Seq (we are doing a custom paired end protocol with tags more likely to originate from the start site of transcripts).
Anyway I use clustering of the tags to define genes, and then map the tags back onto the clusters ("genes") to get the counts of the clusters in different conditions, then feed this as input into DEseq.
From what I understand, DESeq will estimate the library sizes with "estimateSizeFactors". However, this can only take into account tags that have contributed to my cluster definitions! But there are also tags which mapped somewhere, but did not contribute to a cluster definition.
If I were using RNA-Seq, and would use RefSeq as the probes over which I determine the raw counts, I think the problem would be the same:
Shouldn't I rather normalize by the total number of tags aligned? I.e. all tags mapped, not only the ones that contributed to clusters?
Because imagine this situation: there are two libraries with very different sequencing depth. The different sequencing depths might mainly be caused by sporadic tags outside of your probes (genes, clusters). Inside the probes, the total amount of tags might be more or less similar between libraries. Then, if normalizing library size just by tags mapping within clusters, don't you get wrong results?
The protocol is somewhat similar to RNA-Seq, but not exactly RNA-Seq (we are doing a custom paired end protocol with tags more likely to originate from the start site of transcripts).
Anyway I use clustering of the tags to define genes, and then map the tags back onto the clusters ("genes") to get the counts of the clusters in different conditions, then feed this as input into DEseq.
From what I understand, DESeq will estimate the library sizes with "estimateSizeFactors". However, this can only take into account tags that have contributed to my cluster definitions! But there are also tags which mapped somewhere, but did not contribute to a cluster definition.
If I were using RNA-Seq, and would use RefSeq as the probes over which I determine the raw counts, I think the problem would be the same:
Shouldn't I rather normalize by the total number of tags aligned? I.e. all tags mapped, not only the ones that contributed to clusters?
Because imagine this situation: there are two libraries with very different sequencing depth. The different sequencing depths might mainly be caused by sporadic tags outside of your probes (genes, clusters). Inside the probes, the total amount of tags might be more or less similar between libraries. Then, if normalizing library size just by tags mapping within clusters, don't you get wrong results?
Comment