SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Fastxtoolkit nucleotide distribution issue jeny Bioinformatics 4 08-06-2012 04:59 PM
GATK to discover Single Nucleotide Variation in mature miRNA from miRNA-Seq Bioinfo83 Bioinformatics 0 01-31-2012 04:11 AM
[Galaxy] Strange QC Nucleotides Distribution Chart zippered_ohio Bioinformatics 5 06-28-2011 11:43 PM
PubMed: Genome-wide distribution of DNA methylation at single-nucleotide resolution. Newsbot! Literature Watch 0 04-22-2011 11:14 PM
Nucleotide distribution useful? foxyg Bioinformatics 2 09-21-2010 09:11 AM

Reply
 
Thread Tools
Old 07-11-2011, 08:17 AM   #1
lionelguy
Junior Member
 
Location: Sweden

Join Date: Apr 2009
Posts: 8
Default Periodic variation in nucleotide distribution along the read & other strange things

Hi,

I received the results from a run from our sequence provider. We sequenced 11 bacterial samples (whole-genome extract from a ~55% GC bacteria) on a HiSeq2000, with short-paired ends, 100nt each.

Since the DNA that we provided is whole-genome extracts, double-stranded, looking at the nucleotide called per cycle (x-axis), we should see four straight, horizontal lines, with A exactly over T and G exactly over C, but that is not what we see (see attached pdf, where there is one page for each end and one plot for each sample):
  • There is some apparently random variation over the first 7-10 bases, that can eventually be explained by some remaining tags.
  • Over the first 20-40 first bases, the G+C increases, and the A+T decreases, by several percents. As far as I can tell, this can't be explained by trailing adapters.
  • Once the G+C is stabilized, there is still (in some samples, not in all), a significant difference between the numbers of As and Ts on one side, and of Gs, and Cs on the other. The effect is stronger between T and A. For example in sample 6, read 2, it reaches 2.5% difference, averaged between bases 41-60. There, there has to be some technical issue either with the chemistry or the base calling, because since our DNA is double-stranded to start with, it necessarily has G=C and T=A.
  • There is also a significant periodic variation with a 3-cycle period. The differences are strong, reaching up to 1.3 % between points 1 and 2 of the period, in the C count. I have already seen such a periodic variation for Illumina runs, but this is much stronger than anything seen previously. These are also necessarily technical errors, and cannot come from the DNA.

Any thoughts about that? I already contacted our service provider, but I wanted to have the opinion of the community about these points... Thanks for your help!
Attached Files
File Type: pdf stats_ends.pdf (89.0 KB, 151 views)
lionelguy is offline   Reply With Quote
Old 07-11-2011, 09:54 AM   #2
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

These observations are very interesting. Nonetheless, I would not be so surprised to see all these.

1) Sometimes the first ~10 cycles are not stable. Are you reads having lower base quality.

2&3) If things have been changed, Illumina has this notorious cross-talk between A/C signals and G/T signals, and the level of cross-talk increases with increasing cycles. Illumina definitely has biases in the GC content before correcting errors.

4) The start positions of reads are not random according to some works on de novo assembly. If this is true, it must be correlated to the sequence context. You are sequencing bacteria, most of which are coding regions and thus have the period 3 effect (e.g. the GC content at the 3 phases may be different). This may lead to the period in your sequence data. Just a guess.
lh3 is offline   Reply With Quote
Old 07-11-2011, 10:55 PM   #3
lionelguy
Junior Member
 
Location: Sweden

Join Date: Apr 2009
Posts: 8
Default

Thanks for your help! Still, I think this run is not "normal" and something fishy happened...

1) The first 10 bases or so have a somewhat lower (~34) quality than the next 20 (40). But the difference in quality cannot account for the whole thing. I would expect qualities <20 there.

2&3) Here there is no increase along the read, but the cross-talk could explain part of the difference, indeed. But 2.5% difference between A and T seems like an awful lot of cross talk: 1 every 40 bases must be wrong (i.e. quality ~15), and this is not reflected in the error rate, which are sky-high there (~35).

4) I strongly doubt that codon usage could explain such a regularity. Even though a lot of sequences are coding regions in bacteria, I don't see how both direct and complementary strand would be read in the same frame, and this for all the reads... Seems more like a technical thing to me. Is there any step (washing, recalibrating, etc...) that occurs every third cycle in Illumina sequencing?
lionelguy is offline   Reply With Quote
Old 07-12-2011, 04:23 AM   #4
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I think you also need to do the plot for some public hiseq data sets from SRA to make sure this is not a failed run. Quality alone is not always telling.
lh3 is offline   Reply With Quote
Old 07-12-2011, 04:39 AM   #5
lionelguy
Junior Member
 
Location: Sweden

Join Date: Apr 2009
Posts: 8
Default

I did that for some data from BGI - same machine, same design. The results are quite different. The periodic variation can be observed, but to a much lower extent, and the instability on the first 10-20 nucleotides as well

On the contrary, the deviation from G=C and A=T is not observed, except at the very end of the read (and not to the same extent than in my samples).

And I agree, quality alone is not telling, but then the error models should be adapted, because if I have a Q value of 40 and I should expect 100 times more errors, I'd rather know it...

So you suspect a failed run in my case? And what about the periodic variation, which seems to be present in the BGI run as well? Any idea? I searched SeqAnswers and Google extensively, but as far as I can tell no one has observed that phenomenon...
lionelguy is offline   Reply With Quote
Old 07-12-2011, 04:40 AM   #6
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

You probably have a few different things going on here. But the stuff going on at the beginning of your reads is probably adapter dimer sequence. You can probably read the peaks off by eye and you will likely see the sequence of whatever adapter was used. I was able to do this using a similar analysis on some SMART cDNAs we decided to throw on the instrument:

http://seqanswers.com/forums/showthr...0840#post40840

My guess is that the later periodicity would disappear if you removed lower quality bases (below 20 or so) from your data set. (Or at least your plot). At very low quality values the base calls are little more than "guesses". So I think that periodicity is just some bias towards certain bases in even vs. odd cycles.

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-12-2011, 05:17 AM   #7
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

My guess is the starting coordinate of a read/pair is context dependent, for example preferable from certain oligonucleotide. Nonetheless, if you still see a period of 3 from human data, my hypothesis must be wrong.

The deviation from G=C and A=T in BGI data is expected: cross-talk is mainly an issue at the end of a read. I heard that a recent Illumina pipeline gives lower quality for the first ~10 cycles - they are aware of the issue. But in general, it is hard to get a scoring model that is works well under every artifacts. All the hiseq data I have seen have very low error rate and presumably does not have this G=C and A=T deviation.
lh3 is offline   Reply With Quote
Old 07-12-2011, 05:47 AM   #8
lionelguy
Junior Member
 
Location: Sweden

Join Date: Apr 2009
Posts: 8
Default

Quote:
Originally Posted by pmiguel View Post
My guess is that the later periodicity would disappear if you removed lower quality bases (below 20 or so) from your data set. (Or at least your plot). At very low quality values the base calls are little more than "guesses".
I will definitely try that.

Quote:
Originally Posted by pmiguel View Post
So I think that periodicity is just some bias towards certain bases in even vs. odd cycles.
Except that the period is 3, not 2, so it can't be even vs. odd...

I will also test it with data from eukaryotes, to see if the periodicity disappears.
Thanks!
lionelguy is offline   Reply With Quote
Old 07-12-2011, 06:39 AM   #9
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by lionelguy View Post
I will definitely try that.
Please post your results, if you would...

Quote:
Originally Posted by lionelguy View Post
Except that the period is 3, not 2, so it can't be even vs. odd...

I will also test it with data from eukaryotes, to see if the periodicity disappears.
Thanks!
The chance that your picking up a biological periodicity seems very low to me. It would require:
(1) That such a periodicity exists.
(2) That the amplicon construction process is, to some extent, period specific as well. That is, that the fragmentation/ligation process results in a bias in where the reads start in the period.

Neither of these seems likely to me.

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-14-2011, 01:28 AM   #10
lionelguy
Junior Member
 
Location: Sweden

Join Date: Apr 2009
Posts: 8
Default

The sequencing platform sent me an email, and they checked a couple more things. Shortly:
  • the bases at the beginning are of lower quality because Illumina changed the base calling software and are now calling first bases with lower quality
  • they tried to look at nucleotide distribution ignoring all bases with Q < 30, but the periodic variation stays (I don't have their figure though)
  • they had no explanation for the G/C and A/T imbalance

They have contacted Illumina and are waiting for their response.

I checked if I could find the periodic variation on an eukaryote sample, without success (I tested one single run from WGS of Sorex araneus, acc SRR099426). But I agree with Phillip, the likelihood to find the codon structure there seems extremely unlikely. I would favor a technical artefact.
lionelguy is offline   Reply With Quote
Old 07-14-2011, 05:46 AM   #11
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by lionelguy View Post
[...]

I checked if I could find the periodic variation on an eukaryote sample, without success (I tested one single run from WGS of Sorex araneus, acc SRR099426). But I agree with Phillip, the likelihood to find the codon structure there seems extremely unlikely. I would favor a technical artefact.
For "codon structure" sequence biases to be detectable in this manner it would require the adapter to be ligated at the same place in the cycle. That is, all the inserts would need to start at (for example) the peak of your cycle. That seems unlikely to me no matter what the source of DNA is (prokaryotic or eukaryotic).

--
Phillip
pmiguel is offline   Reply With Quote
Old 07-14-2011, 06:00 AM   #12
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

We are seeing very weak periodicity, which can be caused by very subtle bias. It is known that the starting position of a read is not entirely random. I am not saying my speculation is a likely explanation. I just think we cannot exclude this possibility unless we see this pattern in vertebrates (one lane showing the periodicity is more than enough), or find a better explanation. Also, non-random read starting position is also a "technical artifact".
lh3 is offline   Reply With Quote
Old 07-17-2011, 12:23 PM   #13
dmborek
Junior Member
 
Location: USA

Join Date: Jun 2010
Posts: 5
Default

Have you by any chance used Nextera kit for preparing libraries?
dmborek is offline   Reply With Quote
Old 08-02-2011, 08:14 PM   #14
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default

I read this thread with great interest because I have just discovered the same 3 bp periodicity artifact in some of my recent data, though it's not as strong as the OP's. I have microbial metagenomic libraries sequenced using an overlapping 100 bp PE approach on the HiSeq. I assembled the overlapping reads, quality filtered to require 90% of bases > Q30, and trimmed off leading inline barcodes (though not reverse ones in the graphs shown). The inline barcoded libraries were run 12 per lane with a spike-in of 5% phiX to create greater complexity in the first 4 cycles for improved cluster detection. What's interesting is that the phiX spike-in shows very very weak to no periodicity in GC content (file phix_spikein) while the libraries show a definite 3 bp periodicity in sequence composition (other two files). The phiX control lane showed an extremely subtle 3 bp periodicity in GC content (file phix_controllane_gc). So the periodicity seems to be present across lanes, though it is much stronger in the microbial libraries. I don't believe this to be a quality issue since it shows up in quality filtered, assembled reads and the overall cluster yield was very good.

Perhaps lh3 has a point about a biological basis for this 3 bp cycling though I have a hard time envisioning how that would work. Other lanes on the same flow cell as my reads had vertebrate sequence so I'll take a look at that next, but am very interested in any input.

PS libraries were prepped with Covaris shearing, Pippin prep size selection, and custom inline adapters. They were quantified with Kapa qPCR before pooling and loading.
Attached Images
File Type: jpg per_base_gc_content.jpg (83.7 KB, 26 views)
File Type: jpg per_base_sequence_content.jpg (92.2 KB, 23 views)
File Type: jpg phix_spikein.jpg (28.8 KB, 13 views)
File Type: jpg phix_controllane_gc.jpg (51.4 KB, 14 views)

Last edited by greigite; 08-02-2011 at 08:14 PM. Reason: add information
greigite is offline   Reply With Quote
Old 08-03-2011, 08:07 AM   #15
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default no periodicity in xenopus data

Update on my last post: I do not see the 3 bp periodicity in someone else's Xenopus sequence capture libraries sequenced in another lane of the same flow cell as my microbial libraries, except perhaps for a little bit in the first few cycles. So it appears to be specific to, or exaggerated in, the microbial libraries.
Attached Images
File Type: jpg xenopus_gc.jpg (64.6 KB, 14 views)
greigite is offline   Reply With Quote
Old 08-03-2011, 08:35 AM   #16
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Another way to check if you have the reference genome. Map your reads to the reference genome, collect reads mapped to the forward strand and in coding regions, and count the number of reads starting at phase 1, 2 and 3 separately. If my hypothesis is correct, the 3 numbers will be statistically different and correlated with the GC% at the 3 phases.
lh3 is offline   Reply With Quote
Old 08-03-2011, 09:04 AM   #17
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default

Quote:
Originally Posted by lh3 View Post
Another way to check if you have the reference genome. Map your reads to the reference genome, collect reads mapped to the forward strand and in coding regions, and count the number of reads starting at phase 1, 2 and 3 separately. If my hypothesis is correct, the 3 numbers will be statistically different and correlated with the GC% at the 3 phases.
Great suggestion. Unfortunately, my libraries are metagenomes- so no reference is available. I hope someone else will be able to do this analysis and post the results!
greigite is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:57 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO