Dear all,
I have two RNASeq experiments from which I try to understand mapping rate difference. The experiments are pretty different:
Experiment 1:
Samples: 19 primary tumors
RNA selection: TruSeq Stranded mRNA (PolyA)
Sequencing: Paired-ends, 75bp
Sequencer: NextSeq 500
Experiment 2:
Samples: 16 tumor cell lines
RNA selection: TruSeq total RNA Stranded (Ribo zero)
Sequencing: Paired-ends, 100bp
Sequencer: HiSeq 4000
Below are the number of reads in each strand after preprocessing and mapping info. Mapping was performed using TopHat v2.1.0 and Bowtie v2.2.5.0 on hg19 reference genome. The average mapping rate is classic for recent experiments. Nevertheless, there is a difference of 10% between both experiments:
Experiment 1:
Mean number of reads after Trimmomatic (*_paired files): 67944508.54, SD=8215167.56, range=53282567-88949193
Overall read mapping rate: 97.71, SD=0.65, range=96-99
Concordant pair alignment rate: 94.41, SD=1.27, range=91.6-96.8
Experiment 2:
Mean number of reads after Trimmomatic (*_paired files): 90733572.69, SD=12822798.96, range=72880790-122816426
Overall read mapping rate: 87.12, SD=5.16, range=77.9-93.4
Concordant pair alignment rate: 82.74, SD=5.70, range=72.3-89.8
I was asked from what was due this difference. I do not have the answer for sure, only hypotheses. I would greatly appreciate your opinion on these hypotheses:
1. Ribo-zero experiments have commonly lower mapping rate due to the presence of non coding elements that correspond to less conserved regions (=>less well annotated) and/or in repeated regions.
I spend a couple of hours looking for mapping rate info in similar recent (publications between 2014 and now) experiments without any success. The mapping rate seems to be not reported anymore in the publications - for those that I checked at least.
2. We can observe a slight lower base quality, especially around the 75th base in the Ribo-zero samples (attached: one representative Ribo zero and one representative PolyA samples after preprocessing). This could reduce the mapping rate.
Until now, I used 20 as threshold for both experiments from HiSeq and NextSeq. I read yesterday that a threshold of 20 was too high for reads from NextSeq. Can you confirm this? I am trying a more stringent base quality threshold (20 -> 24) on Ribo zero samples during preprocessing.
3. Different sequencers. Has a higher mapping rate been reported for NextSeq experiments?
Finally, I used the 75 first bases from 3 samples of the Ribo zero experiment instead of the 100 bases. It increased from 1 to 2.5% the mapping rate, which is a bit surprising for me since mapping is supposed to increase as read length increases. On the other hand, there is this slight default around 75bp in these samples.
I would really appreciate any feedback on mapping rates with both RNA selections and sequencers, any link to similar analyses or any test to try.
Thank you in advance,
Jane
I have two RNASeq experiments from which I try to understand mapping rate difference. The experiments are pretty different:
Experiment 1:
Samples: 19 primary tumors
RNA selection: TruSeq Stranded mRNA (PolyA)
Sequencing: Paired-ends, 75bp
Sequencer: NextSeq 500
Experiment 2:
Samples: 16 tumor cell lines
RNA selection: TruSeq total RNA Stranded (Ribo zero)
Sequencing: Paired-ends, 100bp
Sequencer: HiSeq 4000
Below are the number of reads in each strand after preprocessing and mapping info. Mapping was performed using TopHat v2.1.0 and Bowtie v2.2.5.0 on hg19 reference genome. The average mapping rate is classic for recent experiments. Nevertheless, there is a difference of 10% between both experiments:
Experiment 1:
Mean number of reads after Trimmomatic (*_paired files): 67944508.54, SD=8215167.56, range=53282567-88949193
Overall read mapping rate: 97.71, SD=0.65, range=96-99
Concordant pair alignment rate: 94.41, SD=1.27, range=91.6-96.8
Experiment 2:
Mean number of reads after Trimmomatic (*_paired files): 90733572.69, SD=12822798.96, range=72880790-122816426
Overall read mapping rate: 87.12, SD=5.16, range=77.9-93.4
Concordant pair alignment rate: 82.74, SD=5.70, range=72.3-89.8
I was asked from what was due this difference. I do not have the answer for sure, only hypotheses. I would greatly appreciate your opinion on these hypotheses:
1. Ribo-zero experiments have commonly lower mapping rate due to the presence of non coding elements that correspond to less conserved regions (=>less well annotated) and/or in repeated regions.
I spend a couple of hours looking for mapping rate info in similar recent (publications between 2014 and now) experiments without any success. The mapping rate seems to be not reported anymore in the publications - for those that I checked at least.
2. We can observe a slight lower base quality, especially around the 75th base in the Ribo-zero samples (attached: one representative Ribo zero and one representative PolyA samples after preprocessing). This could reduce the mapping rate.
Until now, I used 20 as threshold for both experiments from HiSeq and NextSeq. I read yesterday that a threshold of 20 was too high for reads from NextSeq. Can you confirm this? I am trying a more stringent base quality threshold (20 -> 24) on Ribo zero samples during preprocessing.
3. Different sequencers. Has a higher mapping rate been reported for NextSeq experiments?
Finally, I used the 75 first bases from 3 samples of the Ribo zero experiment instead of the 100 bases. It increased from 1 to 2.5% the mapping rate, which is a bit surprising for me since mapping is supposed to increase as read length increases. On the other hand, there is this slight default around 75bp in these samples.
I would really appreciate any feedback on mapping rates with both RNA selections and sequencers, any link to similar analyses or any test to try.
Thank you in advance,
Jane
Comment