Hello,
For a class project, I am mapping reads and counting reads per gene with RSEM for five different published RNA-seq datasets located at GEO / SRA. Most of the datasets were from raw RNA-seq data. Then, I am taking the outputs of RSEM and using them as inputs for DESeq and EdgeR. The goal of the project is to simply compare EdgeR and DESeq results and try to understand why they are different.
My problem is that for 4/5 of the datasets, I am seeing that several genes are differentially expressed even after filtering low count data.
I applied some heavy-duty arbitrary non-specific filtering and it still did not help. In this filtration, I took the max number of reads for each sample (in columns) and sorted the genes (in rows) by their maxes from highest to lowest. Then, I removed the bottom 75% of the data. Additionally, from the remaining 25% of this data, I filtered out genes where the ranks (with rank #1 having the highest number of reads) of each sample were all under 1000. I did this secondary filter to remove genes with excessive numbers of reads. This secondary filter did not help at all.
Despite this filtering, I cannot seem to get a reasonable number of differentially expressed genes. I am getting too many genes as mentioned earlier. There are 2 biological replicates for two different conditions in each of my datasets. The 4 datasets that are causing a problem are comparing gene expression between 1) embryo and endosperm in arabidopsis seeds, 2) benign and malignant prostate cancer cell lines, 3) mouse embryonic stem cells and mouse embryonic fibroblasts, and 4) kidney and liver tissue in human. I downloaded these datasets from GEO / SRA in sra.lite format and converted them to fastq. They were all single-read data.
The RNA-Seq dataset that seems to be working is comparing gene expression in estrogen-treated and untreated breast cancer cell lines. I obtained about 500 to 1000 differentially expressed genes for adjusted pvalues between 0.01 and 0.1 by FDR (BH) for both packages. Also, for this dataset, only a bed file was available at GEO. So, I just wrote custom perl scripts to convert the bed file to a fasta file to input it into RSEM (I extracted 32 base reads from the start coordinates in the genome with the proper orientation). It is also important to note that I excluded reads that mapped to rRNA according to the bed file. The bedfile basically defined rRNA as a chromosome.
I thought it interesting to note that when I did not filter the low count data from the estrogen RNA-seq dataset, DESeq reported around 7000 genes for the pvalue range mentioned above whereas edgeR still reported around 500 to 1000 genes. In the unfiltered dataset there were ~27,000 genes.
I know this is a lot of information but if anyone has any suggestions or advice on how to deal with this issue, I would greatly appreciate it. My project is due Thursday, and I getting tied up in the preprocessing and not the analysis, which was the whole point of the project.
Thanks,
Clayton
Purdue University
Graduate Student
For a class project, I am mapping reads and counting reads per gene with RSEM for five different published RNA-seq datasets located at GEO / SRA. Most of the datasets were from raw RNA-seq data. Then, I am taking the outputs of RSEM and using them as inputs for DESeq and EdgeR. The goal of the project is to simply compare EdgeR and DESeq results and try to understand why they are different.
My problem is that for 4/5 of the datasets, I am seeing that several genes are differentially expressed even after filtering low count data.
I applied some heavy-duty arbitrary non-specific filtering and it still did not help. In this filtration, I took the max number of reads for each sample (in columns) and sorted the genes (in rows) by their maxes from highest to lowest. Then, I removed the bottom 75% of the data. Additionally, from the remaining 25% of this data, I filtered out genes where the ranks (with rank #1 having the highest number of reads) of each sample were all under 1000. I did this secondary filter to remove genes with excessive numbers of reads. This secondary filter did not help at all.
Despite this filtering, I cannot seem to get a reasonable number of differentially expressed genes. I am getting too many genes as mentioned earlier. There are 2 biological replicates for two different conditions in each of my datasets. The 4 datasets that are causing a problem are comparing gene expression between 1) embryo and endosperm in arabidopsis seeds, 2) benign and malignant prostate cancer cell lines, 3) mouse embryonic stem cells and mouse embryonic fibroblasts, and 4) kidney and liver tissue in human. I downloaded these datasets from GEO / SRA in sra.lite format and converted them to fastq. They were all single-read data.
The RNA-Seq dataset that seems to be working is comparing gene expression in estrogen-treated and untreated breast cancer cell lines. I obtained about 500 to 1000 differentially expressed genes for adjusted pvalues between 0.01 and 0.1 by FDR (BH) for both packages. Also, for this dataset, only a bed file was available at GEO. So, I just wrote custom perl scripts to convert the bed file to a fasta file to input it into RSEM (I extracted 32 base reads from the start coordinates in the genome with the proper orientation). It is also important to note that I excluded reads that mapped to rRNA according to the bed file. The bedfile basically defined rRNA as a chromosome.
I thought it interesting to note that when I did not filter the low count data from the estrogen RNA-seq dataset, DESeq reported around 7000 genes for the pvalue range mentioned above whereas edgeR still reported around 500 to 1000 genes. In the unfiltered dataset there were ~27,000 genes.
I know this is a lot of information but if anyone has any suggestions or advice on how to deal with this issue, I would greatly appreciate it. My project is due Thursday, and I getting tied up in the preprocessing and not the analysis, which was the whole point of the project.
Thanks,
Clayton
Purdue University
Graduate Student
Comment