Hey all, I am analyzing several RNA-Seq datasets and have noticed somewhat of an odd pattern. In all of the samples I've QC'd so far, there is a bias or "shoulder" towards reads with a low GC content. I am wondering what causes this, if it's a problem, and if so, what should I do about it.
I am working with 3 different RNA-Seq datasets from two different organsims: Drosophila melanogaster and Aedes aegypti. These datasets were produced by 3 different research groups (including my own) and 3 different sequencing companies, so I doubt its an error in sample prep or sequencing. The sequencing was quite deep. The only common factor I can find is that all 3 groups used Illumina TruSeq kits for library construction. I know for a fact that this bias is not caused by the "random" hexamer priming issue or from low sequence quality, slicing the 5' end off and filtering out low quality reads has no effect on the GC bias "shoulder".
Just curious what causes this phenomenon. I'm attaching one of the more extreme examples from before and after basic QC. My guess is that this bias won't really affect differential expression calling (since it's the same for all samples), but it's still weirding me out a bit.
I am working with 3 different RNA-Seq datasets from two different organsims: Drosophila melanogaster and Aedes aegypti. These datasets were produced by 3 different research groups (including my own) and 3 different sequencing companies, so I doubt its an error in sample prep or sequencing. The sequencing was quite deep. The only common factor I can find is that all 3 groups used Illumina TruSeq kits for library construction. I know for a fact that this bias is not caused by the "random" hexamer priming issue or from low sequence quality, slicing the 5' end off and filtering out low quality reads has no effect on the GC bias "shoulder".
Just curious what causes this phenomenon. I'm attaching one of the more extreme examples from before and after basic QC. My guess is that this bias won't really affect differential expression calling (since it's the same for all samples), but it's still weirding me out a bit.
Comment