Hi everyone,
I have read and search a lot about this topic but can not find any solution to my problem. May you will be able to help me.
I am doing an intern-ship in bioinformatics for my master and I have to deal with RNA-seq data. I have 2 sets of experiments (A and B), both having 2 illumina runs of two stages (1 and 2) of a plant. A and B has not been done at the same time and the technology is a bit different, coming up with:
runs about 30M reads for A,
runs about 80M reads for B.
For a given stage the log(RPKM) of the replicates are very well correlated.
When I use EdgeR to obtain a common dispersion from the counts of each runs searching for differential expressed genes between each stage I obtain 0.86. Which seems far too big regarding the correlation of the RPKM. Moreover the number of differentially expressed genes is not consistent with our affymetrix knowledge (about 250 genes when we expected about 1000 genes).
I first think about filtering the list of genes from the one having a count per million below 1 in all conditions. I then obtain a dispersion of 0.76 : still to high...
I also think about getting variance stabilized data (with DESeq) to use with limma but it does not make sense if the samples are not paired, does it?
I am wondering if I am doing something wrong here and if there are any filtration/computation that I should have done to obtain a more consistent common dispersion.
Any idea would be really appreciate,
François
I have read and search a lot about this topic but can not find any solution to my problem. May you will be able to help me.
I am doing an intern-ship in bioinformatics for my master and I have to deal with RNA-seq data. I have 2 sets of experiments (A and B), both having 2 illumina runs of two stages (1 and 2) of a plant. A and B has not been done at the same time and the technology is a bit different, coming up with:
runs about 30M reads for A,
runs about 80M reads for B.
For a given stage the log(RPKM) of the replicates are very well correlated.
When I use EdgeR to obtain a common dispersion from the counts of each runs searching for differential expressed genes between each stage I obtain 0.86. Which seems far too big regarding the correlation of the RPKM. Moreover the number of differentially expressed genes is not consistent with our affymetrix knowledge (about 250 genes when we expected about 1000 genes).
I first think about filtering the list of genes from the one having a count per million below 1 in all conditions. I then obtain a dispersion of 0.76 : still to high...
I also think about getting variance stabilized data (with DESeq) to use with limma but it does not make sense if the samples are not paired, does it?
I am wondering if I am doing something wrong here and if there are any filtration/computation that I should have done to obtain a more consistent common dispersion.
Any idea would be really appreciate,
François
Comment