I have been trying to process and analyze several RNA-seq data sets, but am having trouble with the mapping process. The data are from total RNA (not just mRNA) because we are interested in looking at non-coding RNAs in these samples. I've noticed that many of the tools out there, as well as a majority of the published analyses, are biased towards investigation of mRNA levels and/or differential expression.
I am using the Galaxy platform to process and analyze these data, but a surprisingly low number of the reads are being mapped to my reference genome. For example: I have approximately 48 million reads in one sample, of which only ~500,000 are being mapped with Bowtie or BWA. Looking at the read quality statistics boxplots, only the first 3 bases of the reads have "low" scores, the rest are in the high 30's, with some as high as 41 (these are from Illumina sequencing using the 1.8 version of Casava, so they are back to the original Sanger quality scale).
I thought that with such high-quality scores for most of the reads at every position would allow for a majority of the reads to be mapped. I trimmed the reads, removing the first 3 bases, and then tried to do the alignments with those two tools. I used their default parameters, which I believe includes allowing up to two base mismatches. I am not sure what to look at in the data to determine the cause of the low mapping %. I'm using a "custom" reference genome for the most current version of the C. elegans genome (WS231) because it is not provided in Galaxy. I'm looking for some suggestions about how to troubleshoot this problem, and possibly some links/references to help me figure out how to alter the default parameters (if they are causing my problem) in Bowtie or BWA. As it is, having ~1% of the data mapped doesn't allow me to do any analysis.
I am using the Galaxy platform to process and analyze these data, but a surprisingly low number of the reads are being mapped to my reference genome. For example: I have approximately 48 million reads in one sample, of which only ~500,000 are being mapped with Bowtie or BWA. Looking at the read quality statistics boxplots, only the first 3 bases of the reads have "low" scores, the rest are in the high 30's, with some as high as 41 (these are from Illumina sequencing using the 1.8 version of Casava, so they are back to the original Sanger quality scale).
I thought that with such high-quality scores for most of the reads at every position would allow for a majority of the reads to be mapped. I trimmed the reads, removing the first 3 bases, and then tried to do the alignments with those two tools. I used their default parameters, which I believe includes allowing up to two base mismatches. I am not sure what to look at in the data to determine the cause of the low mapping %. I'm using a "custom" reference genome for the most current version of the C. elegans genome (WS231) because it is not provided in Galaxy. I'm looking for some suggestions about how to troubleshoot this problem, and possibly some links/references to help me figure out how to alter the default parameters (if they are causing my problem) in Bowtie or BWA. As it is, having ~1% of the data mapped doesn't allow me to do any analysis.
Comment