I have 101 bp Illumina GA IIx RNA-seq reads that display strange behavior in their FastQC "Per Base Sequence Content". I'm wondering if anyone knows what is causing this, or has seen similar behavior in their own data.
The FastQC "Per Base Sequence Content" metric measures the proportion of G,T,A, and C content as a function of position along the read. For a random library, the %G, %T, %A, and %C lines should be roughly constant for all positions and should reflect the amount of these bases in the genome.
In the case of my 101 bp reads, the %G, %A, %T, and %C are roughly constant through most of the cycles.... and then as the quality drops toward the 3' ends of the reads, the %G and %A systematically rise while %T and %C systematically drop.
This occurs for both of the lanes we ran, and for both reads of the paired-end libraries. It does not occur, however, in the PhiX lane (but the PhiX lane had better quality).
I attach the FastQC graph, and also the "Per base sequence quality" metric for comparison. (The odd behavior in the first 13 bases at the 5' end are (as far as I know) normal for a random primed RNA-seq library. See Hansen, Brenner & Dudoit (2010), "Biases in Illumina transcriptome sequencing caused by random hexamer priming", http://nar.oxfordjournals.org/conten.../e131.abstract .)
Explanations for this odd behavior? Comments? Should I be trimming the 3' ends off where %G, %T, %A, and %C deviate from their constant values?
Thanks for any help anyone can give! And thanks, too, to the FastQC developers for a very useful tool!
The FastQC "Per Base Sequence Content" metric measures the proportion of G,T,A, and C content as a function of position along the read. For a random library, the %G, %T, %A, and %C lines should be roughly constant for all positions and should reflect the amount of these bases in the genome.
In the case of my 101 bp reads, the %G, %A, %T, and %C are roughly constant through most of the cycles.... and then as the quality drops toward the 3' ends of the reads, the %G and %A systematically rise while %T and %C systematically drop.
This occurs for both of the lanes we ran, and for both reads of the paired-end libraries. It does not occur, however, in the PhiX lane (but the PhiX lane had better quality).
I attach the FastQC graph, and also the "Per base sequence quality" metric for comparison. (The odd behavior in the first 13 bases at the 5' end are (as far as I know) normal for a random primed RNA-seq library. See Hansen, Brenner & Dudoit (2010), "Biases in Illumina transcriptome sequencing caused by random hexamer priming", http://nar.oxfordjournals.org/conten.../e131.abstract .)
Explanations for this odd behavior? Comments? Should I be trimming the 3' ends off where %G, %T, %A, and %C deviate from their constant values?
Thanks for any help anyone can give! And thanks, too, to the FastQC developers for a very useful tool!
Comment