Hello all. I am simulating a 'fishing' experiment to find some DE genes when only using one replicate on existing data and comparing the proportion of genes found in the first 300 ranked to the original that uses 3 replicates (yes, I know this provides no statistical power). I am comparing 6 different DE analysis methods here.
1. Tophat2 alignment -> HTseq-count -> DESeq2
-27/300 - 24357 genes detected
2. Tophat2 alignment -> HTseq-count -> DESeq2 (regularized log2 FC)
-32/300 - 24358 genes detected
3. Hisat2 alignment -> HTseq-count -> DESeq2
-24/300 - 24357 genes detected
4. Hisat2 alignment -> HTSeq-count -> DESeq2 (regularized log2 FC)
-28/300 - 24357 genes detected
5. Hisat2 alignment -> GFOLD-count -> GFOLD-diff (.1 significance cutoff)
-13/300 - 27803 genes detected
6. Hisat2 aligment -> GFOLD-count -> GFOLD-diff (.005 significance cutoff)
-24/300 - 27099 genes deteced
My question is, why do you think the results differ and what would be most accurate to go with? GFOLD specializes in this type of comparison, but it seems that DESeq detects more relevant genes compared to using 3 replicates.
Comparing the similarity of the first 300 genes from 6 against 4, 217 genes were the same! A big question is why does GFOLD detect more genes when the significance value cutoff is lower?
Also note that GFOLD used a .bed file to count features and also treats paired-end files (used here) as 2 single-end files, which may have thrown things a little off.
1. Tophat2 alignment -> HTseq-count -> DESeq2
-27/300 - 24357 genes detected
2. Tophat2 alignment -> HTseq-count -> DESeq2 (regularized log2 FC)
-32/300 - 24358 genes detected
3. Hisat2 alignment -> HTseq-count -> DESeq2
-24/300 - 24357 genes detected
4. Hisat2 alignment -> HTSeq-count -> DESeq2 (regularized log2 FC)
-28/300 - 24357 genes detected
5. Hisat2 alignment -> GFOLD-count -> GFOLD-diff (.1 significance cutoff)
-13/300 - 27803 genes detected
6. Hisat2 aligment -> GFOLD-count -> GFOLD-diff (.005 significance cutoff)
-24/300 - 27099 genes deteced
My question is, why do you think the results differ and what would be most accurate to go with? GFOLD specializes in this type of comparison, but it seems that DESeq detects more relevant genes compared to using 3 replicates.
Comparing the similarity of the first 300 genes from 6 against 4, 217 genes were the same! A big question is why does GFOLD detect more genes when the significance value cutoff is lower?
Also note that GFOLD used a .bed file to count features and also treats paired-end files (used here) as 2 single-end files, which may have thrown things a little off.
Comment