I am doing an RNA-seq differential expression analysis, and am trying to use the cqn package with EdgeR to correct for GC content effects on estimation of expression. I have a few questions about using this package:
1. How should the GC content for each gene be calculated (in concept)? I guess the simplest way would be to calculate it for the exonic regions of the gene. However, this seems to me to be flawed, as different transcripts from the same gene may have different GC contents, and also different expression levels. Say for example that you have a gene with three exons and two transcripts, one (T1) with the first two exons and one (T2) with the second two. Say the GC content taking into account all the exons is 50%. Now say T2 is the predominant isoform, and hardly any T1 is expressed. Say the GC content for T2 is 60% - if you base the correction factor on all the exons in the gene your correction for the expression estimate of this gene is going to be way out. But considering transcript-level expression when calculating the correction factor seems like it would be very complicated, and somewhat circular, as the correction factor is needed to accurately estimate expression...
2. How should the GC content for each gene be calculated (in practice)? I have read this thread (https://support.bioconductor.org/p/58846/), but I can't see how to get the method suggested there to work in my case - half of my genes get excluded from the txdb due to "bad strand information" (I'm using a gtf file produced by Cufflinks and all the novel single exon genes predicted by Cufflinks have no strand information).
3. Same as 1 above but for gene lengths - transcript length may vary a lot between different isoforms of the same gene, leading to the same problem as for GC content.
4. Same as 2 above but for gene lengths.
5. Do people commonly do RNA-seq analysis in EdgeR without this correction? I'm not convinced it will make that much difference, and seems very complicated to implement!
1. How should the GC content for each gene be calculated (in concept)? I guess the simplest way would be to calculate it for the exonic regions of the gene. However, this seems to me to be flawed, as different transcripts from the same gene may have different GC contents, and also different expression levels. Say for example that you have a gene with three exons and two transcripts, one (T1) with the first two exons and one (T2) with the second two. Say the GC content taking into account all the exons is 50%. Now say T2 is the predominant isoform, and hardly any T1 is expressed. Say the GC content for T2 is 60% - if you base the correction factor on all the exons in the gene your correction for the expression estimate of this gene is going to be way out. But considering transcript-level expression when calculating the correction factor seems like it would be very complicated, and somewhat circular, as the correction factor is needed to accurately estimate expression...
2. How should the GC content for each gene be calculated (in practice)? I have read this thread (https://support.bioconductor.org/p/58846/), but I can't see how to get the method suggested there to work in my case - half of my genes get excluded from the txdb due to "bad strand information" (I'm using a gtf file produced by Cufflinks and all the novel single exon genes predicted by Cufflinks have no strand information).
3. Same as 1 above but for gene lengths - transcript length may vary a lot between different isoforms of the same gene, leading to the same problem as for GC content.
4. Same as 2 above but for gene lengths.
5. Do people commonly do RNA-seq analysis in EdgeR without this correction? I'm not convinced it will make that much difference, and seems very complicated to implement!