SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
SNPs distribution plot Joanna_trinh Bioinformatics 4 05-15-2015 06:12 AM
Plot distribution of length of sequences noha osman Introductions 1 03-31-2015 04:12 PM
How to draw the plot of SNV distribution lipeiqiang Bioinformatics 0 05-13-2014 01:17 AM
plot coverage distribution meher Bioinformatics 0 11-30-2012 10:37 AM
frequency distribution plot alessandra85 Bioinformatics 5 01-19-2011 06:11 AM

Reply
 
Thread Tools
Old 11-18-2015, 04:11 PM   #1
acdan
Junior Member
 
Location: Asia

Join Date: Nov 2015
Posts: 8
Default Base distribution plot split issue

Hi, all. I am a freshman using illumina machines.
We have a Hiseq2500 and did PE125 runs. But the results to me are curious.
Please see the images uploaded. The base distribution plots are splited at the very beginning of read 1 and the late cycles of read 2(technical saying read 3). Are these causing by the library themselves or our operation faults or bad reagent lots? How to solve these?
Attached Images
File Type: png 17_genomic_DNA.base.png (31.4 KB, 27 views)
File Type: jpg CatchC92A(11-19-08-47-13).jpg (75.2 KB, 23 views)

Last edited by acdan; 11-18-2015 at 04:24 PM.
acdan is offline   Reply With Quote
Old 11-18-2015, 04:20 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Edit your post and go to "manage attachments" - the images did not show up.

But anyway - base composition divergence near the end can mean a couple things, either adapter contamination or a problem with the sequencer. Anomalies at the beginning of the read are often due to nonrandom shearing.

I suggest you try adapter-trimming and/or examining the insert size distribution to see if this is caused by adapter sequence.
Brian Bushnell is offline   Reply With Quote
Old 11-18-2015, 04:27 PM   #3
acdan
Junior Member
 
Location: Asia

Join Date: Nov 2015
Posts: 8
Default

Thank you very much! I am not yet familiour with the tools.
The images are showing now.
acdan is offline   Reply With Quote
Old 11-18-2015, 04:51 PM   #4
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

If I had a penny everytime someone asked this question, I'd be rich.

This pattern is caused by a not so random hexamer priming.
It is normal and expected.
The first bases are biased towards sequences that prime more efficiently.

Do no trim off the first 13 bases.
You will just be cutting off good quality bases.

Every single one of my runs for the past 4 years has had this bias.

You can find more details about this bias in this widely cited article.
Note that they do propose a correction that no one that I know uses.

Biases in Illumina transcriptome sequencing caused by random hexamer priming
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/

This bias should really be documented more clearly by Illumina, to avoid people wasting too much time searching for the cause of a very well-known bias.
blancha is offline   Reply With Quote
Old 11-18-2015, 05:06 PM   #5
acdan
Junior Member
 
Location: Asia

Join Date: Nov 2015
Posts: 8
Default

You'll be rich dear , thank you for the referrence.
And in you experience, is the former image showing the plots slightly split at the late cycles of read 2 usual?
acdan is offline   Reply With Quote
Old 11-18-2015, 05:09 PM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,766
Default

Short answer yes. It will be library/sample dependent.

In addition to random priming the nextera transposes also shows a similar bias.

Last edited by GenoMax; 11-18-2015 at 05:30 PM.
GenoMax is offline   Reply With Quote
Old 11-18-2015, 05:47 PM   #7
acdan
Junior Member
 
Location: Asia

Join Date: Nov 2015
Posts: 8
Default

There is one thing I still confused. Why random 6-mers priming generates 13 bases bias?
acdan is offline   Reply With Quote
Old 11-18-2015, 07:30 PM   #8
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

It's a good question, and no one seems to have been able to come up with an entirely satisfactory answer.
Here is the answer from the Illumina FAQ, stating that twelve is the length of "the length of two hexamers", which is not very helpful, since I can't see how there could there be 2 hexamers binding.
This document is no longer available on Illumina's website.
Luckily, the FAQ was archived on an older seqanswers thread.

Quote:
Q482. Why is GC high in the first few bases?
It is perfectly normal to observe both a slight GC bias and a distinctly non-random base composition over the first 12 bases of the data. This is observed when looking, for instance, at the IVC (intensity versus cycle number) plots which are part of the output of the Pipeline. In genomic DNA sequencing, the base composition is usually quite uniform across all bases; but in mRNA-Seq, the base composition is noticeably uneven across the first 10 to 12 bases. Illumina believes this effect is caused by the "not so random" nature of the random priming process used in the protocol. This may explain why there is a slight overall G/C bias in the starting positions of each read. The first 12 bases probably represent the sites that were being primed by the hexamers used in the random priming process. The first twelve bases in the random priming full-length cDNA sequencing protocol (mRNA-seq) always have IVC plots that look like what has been described. This is because the random priming is not truly random and the first twelve bases (the length of two hexamers) are biased towards sequences that prime more efficiently.This is entirely normal and expected.
http://seqanswers.com/forums/showthread.php?t=11843
The Hansen paper makes an attempt at answering your question more directly.
Quote:
It is surprising that the pattern extends well beyond the hexamer primer, out to 13 bases. The length of the pattern could potentially be explained by a strong bias in the first 6 bases of the reads, coupled with dependencies between adjacent nucleotides in the transcriptome. Two observations contradict this explanation. First, the pattern in the nucleotide frequencies ends immediately upstream of the first base of the reads, indicating that the dependence between adjacent nucleotides in the transcriptome is weak (Figure 1a). Note that it is possible for a pattern to extend upstream of the reads, as seen with DNase I fragmentation (Figure 1c). Second, dinucleotide transition probabilities appear biased throughout all 13 initial bases (Supplementary Figure S5). The fact that the 5′ bias extends over 13 bases could be explained by the sequence specificity of the polymerase. Alternately, due to the end repair performed as part of the standard DNA sequencing protocol, the first sequenced base of a read may not be where the primer binds.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/
The author of this blog also makes a more amateurish attempt to explain the bias more clearly, but abandons his efforts in frustration.
http://blog.malde.org/posts/illumina-RNA-bias.html

So, none of the explanations are entirely satisfactory.
What is certain is that the overall results remain valid, despite this bias.
Otherwise, one would have to question the entire body of literature on RNA-Seq.
Trimming the bases is also clearly the wrong approach.

I suppose there might be material for another paper for anyone can come up with a sound demonstration for the reason that the bias extends all the way to the first 12 (or 13) bases.

Last edited by blancha; 11-19-2015 at 05:57 AM.
blancha is offline   Reply With Quote
Old 11-19-2015, 04:35 AM   #9
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,147
Default

blancha's post need to be made a sticky, and every time there is a new post with "RNA-Seq bias" anywhere in the text some one can simply post a reply with a link to it.
kmcarr is offline   Reply With Quote
Old 11-19-2015, 04:45 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,766
Default

Quote:
Originally Posted by kmcarr View Post
blancha's post need to be made a sticky, and every time there is a new post with "RNA-Seq bias" anywhere in the text some one can simply post a reply with a link to it.
Done - "Illumina/solexa" sub-forum.

@blancha: I picked a title to describe your post when I made it sticky. If you want alternate wording then send me a PM (or you may be able to edit it yourself).

Last edited by GenoMax; 11-19-2015 at 04:51 AM.
GenoMax is offline   Reply With Quote
Old 11-19-2015, 05:28 PM   #11
acdan
Junior Member
 
Location: Asia

Join Date: Nov 2015
Posts: 8
Default

Is there anyone counted the ratio of hexa sequences distribution in the transcriptome?
In my consideration, such as AAAAA will have a much higher distriobution, which exhausts hexamer primer "TTTTT" the fastest, but blocked by nearby primed "TTTTTT". So such cDNA will be much smaller than others and will get lost in the following steps.
acdan is offline   Reply With Quote
Old 12-15-2017, 10:23 AM   #12
Andres_Ribone
Junior Member
 
Location: Argentina

Join Date: Apr 2017
Posts: 3
Default

Hi, just to be sure:

It is NOT necessary to clip the first 13 bases when doing de novo transcriptome assembly neither?
Andres_Ribone is offline   Reply With Quote
Old 12-15-2017, 10:30 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,766
Default

Quote:
Originally Posted by Andres_Ribone View Post
Hi, just to be sure:

It is NOT necessary to clip the first 13 bases when doing de novo transcriptome assembly neither?
Very likely no. See this blog post for more.
GenoMax is offline   Reply With Quote
Old 12-15-2017, 12:45 PM   #14
Andres_Ribone
Junior Member
 
Location: Argentina

Join Date: Apr 2017
Posts: 3
Default

Hi!, thanks for the quick answer!
I checked the link, but it doesn't states explicitly if clipping is necessary or not for de novo transcriptome assembly; nor could I find any paper that states it.

Right now I'm checking clipping and not clipping on my data, but of course it wouldn't be enough for a good answer.

Have a nice day!
Andres_Ribone is offline   Reply With Quote
Reply

Tags
illumina hiseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:17 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO