I have a question concerning the data normalisation used in DESeq. While I do understand the need of normalisation of differences in library sizes that originate from the library preparation, I have problems normalizing the differences that originate from the biological variance of my samples.
Hopefully the following example can make my point come across:
RNA Samples 1 and 2 are extracted from different tissues/treatments. Library prep was performed for Small RNAs with the same amount of RNA and resulting reads were mapped on the whole genome and annotated with their corresponding mirbase.
Now lets assume Sample 1 has 20 Mio Reads and Sample 2 has 17 Mio Reads after sequencing. After annotation however I end up with 8 Mio reads for Sample 1 and 16 Mio reads for Sample 2 that represent the miRNAs in them.
If I follow the normal DESeq procedure, it will normalize my read counts concerning the 8 and 16 mio library sizes respectively. While this gives me the differences if I would look at the same amount of miRNA in both samples, it does not reflect the differences that were really caused by the treatment. For instance if I plot the insert sizes after adapter trimming I can see a significant shift between these two samples caused by the treatment.
I thought about adding a virtual gene to each sample which would account for the starting library sizes. So in my example for sample 1 that would add 12 mio reads and for sample 2 only 1 mio reads. But I'm not sure how that would alter the variance estimation of DESeq.
Any help would be appreciated.
Regards
Benedikt
Hopefully the following example can make my point come across:
RNA Samples 1 and 2 are extracted from different tissues/treatments. Library prep was performed for Small RNAs with the same amount of RNA and resulting reads were mapped on the whole genome and annotated with their corresponding mirbase.
Now lets assume Sample 1 has 20 Mio Reads and Sample 2 has 17 Mio Reads after sequencing. After annotation however I end up with 8 Mio reads for Sample 1 and 16 Mio reads for Sample 2 that represent the miRNAs in them.
If I follow the normal DESeq procedure, it will normalize my read counts concerning the 8 and 16 mio library sizes respectively. While this gives me the differences if I would look at the same amount of miRNA in both samples, it does not reflect the differences that were really caused by the treatment. For instance if I plot the insert sizes after adapter trimming I can see a significant shift between these two samples caused by the treatment.
I thought about adding a virtual gene to each sample which would account for the starting library sizes. So in my example for sample 1 that would add 12 mio reads and for sample 2 only 1 mio reads. But I'm not sure how that would alter the variance estimation of DESeq.
Any help would be appreciated.
Regards
Benedikt