Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Nucleotide bias in RNASeq data (initial 12-13 bp)

    It's a good question, and no one seems to have been able to come up with an entirely satisfactory answer.
    Here is the answer from the Illumina FAQ, stating that twelve is the length of "the length of two hexameters", which is not very helpful, since I can't see how there could there be 2 hexameters binding.
    This document is no longer available on Illumina's website.
    Luckily, the FAQ was archived on an older seqanswers thread.

    Q482. Why is GC high in the first few bases?
    It is perfectly normal to observe both a slight GC bias and a distinctly non-random base composition over the first 12 bases of the data. This is observed when looking, for instance, at the IVC (intensity versus cycle number) plots which are part of the output of the Pipeline. In genomic DNA sequencing, the base composition is usually quite uniform across all bases; but in mRNA-Seq, the base composition is noticeably uneven across the first 10 to 12 bases. Illumina believes this effect is caused by the "not so random" nature of the random priming process used in the protocol. This may explain why there is a slight overall G/C bias in the starting positions of each read. The first 12 bases probably represent the sites that were being primed by the hexamers used in the random priming process. The first twelve bases in the random priming full-length cDNA sequencing protocol (mRNA-seq) always have IVC plots that look like what has been described. This is because the random priming is not truly random and the first twelve bases (the length of two hexamers) are biased towards sequences that prime more efficiently.This is entirely normal and expected.
    Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)
    The Hansen paper makes an attempt at answering your question more directly.
    It is surprising that the pattern extends well beyond the hexamer primer, out to 13 bases. The length of the pattern could potentially be explained by a strong bias in the first 6 bases of the reads, coupled with dependencies between adjacent nucleotides in the transcriptome. Two observations contradict this explanation. First, the pattern in the nucleotide frequencies ends immediately upstream of the first base of the reads, indicating that the dependence between adjacent nucleotides in the transcriptome is weak (Figure 1a). Note that it is possible for a pattern to extend upstream of the reads, as seen with DNase I fragmentation (Figure 1c). Second, dinucleotide transition probabilities appear biased throughout all 13 initial bases (Supplementary Figure S5). The fact that the 5′ bias extends over 13 bases could be explained by the sequence specificity of the polymerase. Alternately, due to the end repair performed as part of the standard DNA sequencing protocol, the first sequenced base of a read may not be where the primer binds.
    The author of this blog also makes a more amateurish attempt to explain the bias more clearly, but abandons his efforts in frustration.


    So, none of the explanations are entirely satisfactory.
    What is certain is that the overall results remain valid, despite this bias.
    Otherwise, one would have to question the entire body of literature on RNA-Seq.
    Trimming the bases is also clearly the wrong approach.

    I suppose there might be material for another paper for anyone can come up with a sound demonstration for the reason that the bias extends all the way to the first 12 (or 13) bases.
    Last edited by GenoMax; 11-19-2015, 05:50 AM.

Latest Articles

Collapse

  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM
  • seqadmin
    The Impact of AI in Genomic Medicine
    by seqadmin



    Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
    02-26-2024, 02:07 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 03-14-2024, 06:13 AM
0 responses
32 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-08-2024, 08:03 AM
0 responses
71 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-07-2024, 08:13 AM
0 responses
80 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-06-2024, 09:51 AM
0 responses
68 views
0 likes
Last Post seqadmin  
Working...
X