I am currently analyzing RNA-Seq data from bacterial transcriptome. Here I have several questions regarding gene expression quantification and differential expression analysis.
1. For Partially Overlapped reads
If I use RPKM for quantification, how should I count the reads that only partially overlaps with the annotated gene regions in the reference genomes. Should I count each read as 1 no matter how long they overlaps with the gene? or to multiply 1 by a weight corresponding to how much they overlaps? Or to discard those reads and only consider the reads that are completely within a gene annotation?
2. Paired-End reads
Since my data are paired-end reads, should I consider that the gap between two ends contributes to coverage? Some background is that my RNA-seq data is not from a pure culture of bacteria. It might contain several similar species of bacteria. Their genomes are pretty similar, but not identical.
3. Differential Expression Analysis Methodology
a. I‘ve seen some posts discussing about DE methods. T-tests were recommended when there are "many" biological replicates. I am wondering if 5 vs. 5 should be considered as "many" or "OK amount of" replicates?
b. Say, it is OK to use T-test. Since it is not clear whether it is valid to assume the T statistic follow a t-distribution given the data, is it more appropriate to use permutation method to generate null distribution?
Those questions have bothered me for a long time. I appreciate any type of helps!!
Thanks in advance,
Dezhi
1. For Partially Overlapped reads
If I use RPKM for quantification, how should I count the reads that only partially overlaps with the annotated gene regions in the reference genomes. Should I count each read as 1 no matter how long they overlaps with the gene? or to multiply 1 by a weight corresponding to how much they overlaps? Or to discard those reads and only consider the reads that are completely within a gene annotation?
2. Paired-End reads
Since my data are paired-end reads, should I consider that the gap between two ends contributes to coverage? Some background is that my RNA-seq data is not from a pure culture of bacteria. It might contain several similar species of bacteria. Their genomes are pretty similar, but not identical.
3. Differential Expression Analysis Methodology
a. I‘ve seen some posts discussing about DE methods. T-tests were recommended when there are "many" biological replicates. I am wondering if 5 vs. 5 should be considered as "many" or "OK amount of" replicates?
b. Say, it is OK to use T-test. Since it is not clear whether it is valid to assume the T statistic follow a t-distribution given the data, is it more appropriate to use permutation method to generate null distribution?
Those questions have bothered me for a long time. I appreciate any type of helps!!
Thanks in advance,
Dezhi
Comment