Seqanswers Leaderboard Ad

**lh3** · 07-11-2011, 09:54 AM

These observations are very interesting. Nonetheless, I would not be so surprised to see all these.

1) Sometimes the first ~10 cycles are not stable. Are you reads having lower base quality.

2&3) If things have been changed, Illumina has this notorious cross-talk between A/C signals and G/T signals, and the level of cross-talk increases with increasing cycles. Illumina definitely has biases in the GC content before correcting errors.

4) The start positions of reads are not random according to some works on de novo assembly. If this is true, it must be correlated to the sequence context. You are sequencing bacteria, most of which are coding regions and thus have the period 3 effect (e.g. the GC content at the 3 phases may be different). This may lead to the period in your sequence data. Just a guess.

**lionelguy** · 07-11-2011, 10:55 PM

Thanks for your help! Still, I think this run is not "normal" and something fishy happened...

1) The first 10 bases or so have a somewhat lower (~34) quality than the next 20 (40). But the difference in quality cannot account for the whole thing. I would expect qualities <20 there.

2&3) Here there is no increase along the read, but the cross-talk could explain part of the difference, indeed. But 2.5% difference between A and T seems like an awful lot of cross talk: 1 every 40 bases must be wrong (i.e. quality ~15), and this is not reflected in the error rate, which are sky-high there (~35).

4) I strongly doubt that codon usage could explain such a regularity. Even though a lot of sequences are coding regions in bacteria, I don't see how both direct and complementary strand would be read in the same frame, and this for all the reads... Seems more like a technical thing to me. Is there any step (washing, recalibrating, etc...) that occurs every third cycle in Illumina sequencing?

**lh3** · 07-12-2011, 04:23 AM

I think you also need to do the plot for some public hiseq data sets from SRA to make sure this is not a failed run. Quality alone is not always telling.

**lionelguy** · 07-12-2011, 04:39 AM

I did that for some data from BGI - same machine, same design. The results are quite different. The periodic variation can be observed, but to a much lower extent, and the instability on the first 10-20 nucleotides as well

On the contrary, the deviation from G=C and A=T is not observed, except at the very end of the read (and not to the same extent than in my samples).

And I agree, quality alone is not telling, but then the error models should be adapted, because if I have a Q value of 40 and I should expect 100 times more errors, I'd rather know it...

So you suspect a failed run in my case? And what about the periodic variation, which seems to be present in the BGI run as well? Any idea? I searched SeqAnswers and Google extensively, but as far as I can tell no one has observed that phenomenon...

**pmiguel** · 07-12-2011, 04:40 AM

You probably have a few different things going on here. But the stuff going on at the beginning of your reads is probably adapter dimer sequence. You can probably read the peaks off by eye and you will likely see the sequence of whatever adapter was used. I was able to do this using a similar analysis on some SMART cDNAs we decided to throw on the instrument:

Sequencing Analysis Viewer trick. - SEQanswers

http://seqanswers.com/forums/showthread.php?p=40840#post40840

Bridged amplification & clustering followed by sequencing by synthesis. (Genome Analyzer / HiSeq / MiSeq)

My guess is that the later periodicity would disappear if you removed lower quality bases (below 20 or so) from your data set. (Or at least your plot). At very low quality values the base calls are little more than "guesses". So I think that periodicity is just some bias towards certain bases in even vs. odd cycles.

--
Phillip

**lh3** · 07-12-2011, 05:17 AM

My guess is the starting coordinate of a read/pair is context dependent, for example preferable from certain oligonucleotide. Nonetheless, if you still see a period of 3 from human data, my hypothesis must be wrong.

The deviation from G=C and A=T in BGI data is expected: cross-talk is mainly an issue at the end of a read. I heard that a recent Illumina pipeline gives lower quality for the first ~10 cycles - they are aware of the issue. But in general, it is hard to get a scoring model that is works well under every artifacts. All the hiseq data I have seen have very low error rate and presumably does not have this G=C and A=T deviation.

**lionelguy** · 07-12-2011, 05:47 AM

Originally posted by pmiguel View Post

My guess is that the later periodicity would disappear if you removed lower quality bases (below 20 or so) from your data set. (Or at least your plot). At very low quality values the base calls are little more than "guesses".

I will definitely try that.

Originally posted by pmiguel View Post

So I think that periodicity is just some bias towards certain bases in even vs. odd cycles.

Except that the period is 3, not 2, so it can't be even vs. odd...

I will also test it with data from eukaryotes, to see if the periodicity disappears.
Thanks!

**pmiguel** · 07-12-2011, 06:39 AM

Originally posted by lionelguy View Post

I will definitely try that.

Please post your results, if you would...

Originally posted by lionelguy View Post

Except that the period is 3, not 2, so it can't be even vs. odd...

I will also test it with data from eukaryotes, to see if the periodicity disappears.
Thanks!

The chance that your picking up a biological periodicity seems very low to me. It would require:
(1) That such a periodicity exists.
(2) That the amplicon construction process is, to some extent, period specific as well. That is, that the fragmentation/ligation process results in a bias in where the reads start in the period.

Neither of these seems likely to me.

--
Phillip

**lionelguy** · 07-14-2011, 01:28 AM

The sequencing platform sent me an email, and they checked a couple more things. Shortly:

the bases at the beginning are of lower quality because Illumina changed the base calling software and are now calling first bases with lower quality

they tried to look at nucleotide distribution ignoring all bases with Q < 30, but the periodic variation stays (I don't have their figure though)

they had no explanation for the G/C and A/T imbalance

They have contacted Illumina and are waiting for their response.

I checked if I could find the periodic variation on an eukaryote sample, without success (I tested one single run from WGS of Sorex araneus, acc SRR099426). But I agree with Phillip, the likelihood to find the codon structure there seems extremely unlikely. I would favor a technical artefact.

**pmiguel** · 07-14-2011, 05:46 AM

Originally posted by lionelguy View Post

[...]

I checked if I could find the periodic variation on an eukaryote sample, without success (I tested one single run from WGS of Sorex araneus, acc SRR099426). But I agree with Phillip, the likelihood to find the codon structure there seems extremely unlikely. I would favor a technical artefact.

For "codon structure" sequence biases to be detectable in this manner it would require the adapter to be ligated at the same place in the cycle. That is, all the inserts would need to start at (for example) the peak of your cycle. That seems unlikely to me no matter what the source of DNA is (prokaryotic or eukaryotic).

--
Phillip

**lh3** · 07-14-2011, 06:00 AM

We are seeing very weak periodicity, which can be caused by very subtle bias. It is known that the starting position of a read is not entirely random. I am not saying my speculation is a likely explanation. I just think we cannot exclude this possibility unless we see this pattern in vertebrates (one lane showing the periodicity is more than enough), or find a better explanation. Also, non-random read starting position is also a "technical artifact".

**dmborek** · 07-17-2011, 12:23 PM

Have you by any chance used Nextera kit for preparing libraries?

**greigite** · 08-02-2011, 08:14 PM

I read this thread with great interest because I have just discovered the same 3 bp periodicity artifact in some of my recent data, though it's not as strong as the OP's. I have microbial metagenomic libraries sequenced using an overlapping 100 bp PE approach on the HiSeq. I assembled the overlapping reads, quality filtered to require 90% of bases > Q30, and trimmed off leading inline barcodes (though not reverse ones in the graphs shown). The inline barcoded libraries were run 12 per lane with a spike-in of 5% phiX to create greater complexity in the first 4 cycles for improved cluster detection. What's interesting is that the phiX spike-in shows very very weak to no periodicity in GC content (file phix_spikein) while the libraries show a definite 3 bp periodicity in sequence composition (other two files). The phiX control lane showed an extremely subtle 3 bp periodicity in GC content (file phix_controllane_gc). So the periodicity seems to be present across lanes, though it is much stronger in the microbial libraries. I don't believe this to be a quality issue since it shows up in quality filtered, assembled reads and the overall cluster yield was very good.

Perhaps lh3 has a point about a biological basis for this 3 bp cycling though I have a hard time envisioning how that would work. Other lanes on the same flow cell as my reads had vertebrate sequence so I'll take a look at that next, but am very interested in any input.

PS libraries were prepped with Covaris shearing, Pippin prep size selection, and custom inline adapters. They were quantified with Kapa qPCR before pooling and loading.

Attached Files

**greigite** · 08-03-2011, 08:07 AM

no periodicity in xenopus data

Update on my last post: I do not see the 3 bp periodicity in someone else's Xenopus sequence capture libraries sequenced in another lane of the same flow cell as my microbial libraries, except perhaps for a little bit in the first few cycles. So it appears to be specific to, or exaggerated in, the microbial libraries.

Attached Files

xenopus_gc.jpg (64.6 KB, 43 views)

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 50 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Periodic variation in nucleotide distribution along the read & other strange things

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News