I originally posted this question on biostars but received little response. I'll be sure to update either post if I receive any more detail.
Until recently, we have used a poly(A) selection process to prepare our RNA-Seq libraries. In our last run we had to use a ribo-depletion approach instead, as we want to study some formalin-fixed (FF) material with degraded RNA. The facility use Illumina's Ribo-Zero kit. We otherwise kept the same sequencing parameters: paired-end 75bp reverse stranded on an Illumina HiSeq 4000.
Since we don't know how well the FF material represents the original tissue, we also sequenced a few frozen tissue samples, with the intention of comparing the two (though they are _not_ perfectly matched). In total we have 3 FF samples and 2 frozen samples.
Short version:
Both frozen tissue and FFPE results show a low number of reads being assigned to an exon. This is ~60% for FFPE and ~25% for frozen samples, which I did not expect. Is this an issue and can I still compare the two after normalisation for different effective library sizes?
More detail:
I ran the reads through my usual pipeline:
fastQC all looked OK, some highly duplicated sequences, probably rRNA associated, but nothing too major.
STAR alignment resulted in ~90% reads being uniquely assigned in all cases (similar to our poly(A) samples)
I had STAR run gene counts during alignment. The results differed from what I've typically seen in the poly(A) data in terms of the % of reads that assign to a (unique) gene.
Poly(A): we usually get 80-85%
Ribo-depleted FF samples: 24%, 24%, 26%
Ribo-depleted frozen samples: 58%, 59%
So in both cases the numbers assigned are far lower than for poly(A), and this is especially bad for the FF samples. Most of the reads that were not assigned belonged in the 'no feature' category, i.e. they didn't overlap with any exon.
It occurs to me that this difference is probably due to the larger variety of RNA species: poly(A) should enrich primarily for mRNA, while ribo-depletion leaves in ncRNA species, etc. Therefore fewer reads will be mRNA and fall within an exon for gene counting purposes. I ran ezBAMqc to check the distribution of the aligned reads in the BAM files:
FF sample
frozen sample
I dug out a similar plot for one of our poly(A) samples (below). The % intronic reads is indeed much lower.
My hypothesis: the FF library is dominated by species other than mRNA.
Does this sound like a reasonable explanation?
Is the very low proportion of exon-assigned counts a problem (other than being wasteful)?
Is it still reasonable to compare the gene counts of the FF and frozen samples? I would normalise for the total number of reads, but is that sufficient?
Thanks for any thoughts.
Until recently, we have used a poly(A) selection process to prepare our RNA-Seq libraries. In our last run we had to use a ribo-depletion approach instead, as we want to study some formalin-fixed (FF) material with degraded RNA. The facility use Illumina's Ribo-Zero kit. We otherwise kept the same sequencing parameters: paired-end 75bp reverse stranded on an Illumina HiSeq 4000.
Since we don't know how well the FF material represents the original tissue, we also sequenced a few frozen tissue samples, with the intention of comparing the two (though they are _not_ perfectly matched). In total we have 3 FF samples and 2 frozen samples.
Short version:
Both frozen tissue and FFPE results show a low number of reads being assigned to an exon. This is ~60% for FFPE and ~25% for frozen samples, which I did not expect. Is this an issue and can I still compare the two after normalisation for different effective library sizes?
More detail:
I ran the reads through my usual pipeline:
fastQC all looked OK, some highly duplicated sequences, probably rRNA associated, but nothing too major.
STAR alignment resulted in ~90% reads being uniquely assigned in all cases (similar to our poly(A) samples)
I had STAR run gene counts during alignment. The results differed from what I've typically seen in the poly(A) data in terms of the % of reads that assign to a (unique) gene.
Poly(A): we usually get 80-85%
Ribo-depleted FF samples: 24%, 24%, 26%
Ribo-depleted frozen samples: 58%, 59%
So in both cases the numbers assigned are far lower than for poly(A), and this is especially bad for the FF samples. Most of the reads that were not assigned belonged in the 'no feature' category, i.e. they didn't overlap with any exon.
It occurs to me that this difference is probably due to the larger variety of RNA species: poly(A) should enrich primarily for mRNA, while ribo-depletion leaves in ncRNA species, etc. Therefore fewer reads will be mRNA and fall within an exon for gene counting purposes. I ran ezBAMqc to check the distribution of the aligned reads in the BAM files:
FF sample
frozen sample
I dug out a similar plot for one of our poly(A) samples (below). The % intronic reads is indeed much lower.
My hypothesis: the FF library is dominated by species other than mRNA.
Does this sound like a reasonable explanation?
Is the very low proportion of exon-assigned counts a problem (other than being wasteful)?
Is it still reasonable to compare the gene counts of the FF and frozen samples? I would normalise for the total number of reads, but is that sufficient?
Thanks for any thoughts.
Comment