Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa

Similar Threads
Thread Thread Starter Forum Replies Last Post
Extreme 5' nucleotide bias in 2nd pair Illumina Hiseq reads evt8 Illumina/Solexa 8 08-10-2014 01:55 PM
Initial QC and grooming for Illumina HiSeq2000 paired end RNAseq on Galaxy lindseykelly RNA Sequencing 5 07-30-2014 01:09 PM
PubMed: Nucleotide Bias Observed with a Short SELEX RNA Aptamer Library. Newsbot! Literature Watch 0 07-29-2011 02:00 AM
How to present the nucleotide bias of small RNAs using weblogo satp Bioinformatics 0 02-09-2010 11:51 PM
Bias toward G in first nucleotide in sequence? sem Sample Prep / Library Generation 0 01-16-2009 11:54 AM

Thread Tools
Old 11-18-2015, 07:30 PM   #1
Senior Member
Location: Montreal

Join Date: May 2013
Posts: 367
Default Nucleotide bias in RNASeq data (initial 12-13 bp)

It's a good question, and no one seems to have been able to come up with an entirely satisfactory answer.
Here is the answer from the Illumina FAQ, stating that twelve is the length of "the length of two hexameters", which is not very helpful, since I can't see how there could there be 2 hexameters binding.
This document is no longer available on Illumina's website.
Luckily, the FAQ was archived on an older seqanswers thread.

Q482. Why is GC high in the first few bases?
It is perfectly normal to observe both a slight GC bias and a distinctly non-random base composition over the first 12 bases of the data. This is observed when looking, for instance, at the IVC (intensity versus cycle number) plots which are part of the output of the Pipeline. In genomic DNA sequencing, the base composition is usually quite uniform across all bases; but in mRNA-Seq, the base composition is noticeably uneven across the first 10 to 12 bases. Illumina believes this effect is caused by the "not so random" nature of the random priming process used in the protocol. This may explain why there is a slight overall G/C bias in the starting positions of each read. The first 12 bases probably represent the sites that were being primed by the hexamers used in the random priming process. The first twelve bases in the random priming full-length cDNA sequencing protocol (mRNA-seq) always have IVC plots that look like what has been described. This is because the random priming is not truly random and the first twelve bases (the length of two hexamers) are biased towards sequences that prime more efficiently.This is entirely normal and expected.
The Hansen paper makes an attempt at answering your question more directly.
It is surprising that the pattern extends well beyond the hexamer primer, out to 13 bases. The length of the pattern could potentially be explained by a strong bias in the first 6 bases of the reads, coupled with dependencies between adjacent nucleotides in the transcriptome. Two observations contradict this explanation. First, the pattern in the nucleotide frequencies ends immediately upstream of the first base of the reads, indicating that the dependence between adjacent nucleotides in the transcriptome is weak (Figure 1a). Note that it is possible for a pattern to extend upstream of the reads, as seen with DNase I fragmentation (Figure 1c). Second, dinucleotide transition probabilities appear biased throughout all 13 initial bases (Supplementary Figure S5). The fact that the 5′ bias extends over 13 bases could be explained by the sequence specificity of the polymerase. Alternately, due to the end repair performed as part of the standard DNA sequencing protocol, the first sequenced base of a read may not be where the primer binds.
The author of this blog also makes a more amateurish attempt to explain the bias more clearly, but abandons his efforts in frustration.

So, none of the explanations are entirely satisfactory.
What is certain is that the overall results remain valid, despite this bias.
Otherwise, one would have to question the entire body of literature on RNA-Seq.
Trimming the bases is also clearly the wrong approach.

I suppose there might be material for another paper for anyone can come up with a sound demonstration for the reason that the bias extends all the way to the first 12 (or 13) bases.

Last edited by GenoMax; 11-19-2015 at 04:50 AM.
blancha is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 01:37 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO