Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Periodic variation in nucleotide distribution along the read & other strange things

    Hi,

    I received the results from a run from our sequence provider. We sequenced 11 bacterial samples (whole-genome extract from a ~55% GC bacteria) on a HiSeq2000, with short-paired ends, 100nt each.

    Since the DNA that we provided is whole-genome extracts, double-stranded, looking at the nucleotide called per cycle (x-axis), we should see four straight, horizontal lines, with A exactly over T and G exactly over C, but that is not what we see (see attached pdf, where there is one page for each end and one plot for each sample):
    • There is some apparently random variation over the first 7-10 bases, that can eventually be explained by some remaining tags.
    • Over the first 20-40 first bases, the G+C increases, and the A+T decreases, by several percents. As far as I can tell, this can't be explained by trailing adapters.
    • Once the G+C is stabilized, there is still (in some samples, not in all), a significant difference between the numbers of As and Ts on one side, and of Gs, and Cs on the other. The effect is stronger between T and A. For example in sample 6, read 2, it reaches 2.5% difference, averaged between bases 41-60. There, there has to be some technical issue either with the chemistry or the base calling, because since our DNA is double-stranded to start with, it necessarily has G=C and T=A.
    • There is also a significant periodic variation with a 3-cycle period. The differences are strong, reaching up to 1.3 % between points 1 and 2 of the period, in the C count. I have already seen such a periodic variation for Illumina runs, but this is much stronger than anything seen previously. These are also necessarily technical errors, and cannot come from the DNA.


    Any thoughts about that? I already contacted our service provider, but I wanted to have the opinion of the community about these points... Thanks for your help!
    Attached Files

  • #2
    These observations are very interesting. Nonetheless, I would not be so surprised to see all these.

    1) Sometimes the first ~10 cycles are not stable. Are you reads having lower base quality.

    2&3) If things have been changed, Illumina has this notorious cross-talk between A/C signals and G/T signals, and the level of cross-talk increases with increasing cycles. Illumina definitely has biases in the GC content before correcting errors.

    4) The start positions of reads are not random according to some works on de novo assembly. If this is true, it must be correlated to the sequence context. You are sequencing bacteria, most of which are coding regions and thus have the period 3 effect (e.g. the GC content at the 3 phases may be different). This may lead to the period in your sequence data. Just a guess.

    Comment


    • #3
      Thanks for your help! Still, I think this run is not "normal" and something fishy happened...

      1) The first 10 bases or so have a somewhat lower (~34) quality than the next 20 (40). But the difference in quality cannot account for the whole thing. I would expect qualities <20 there.

      2&3) Here there is no increase along the read, but the cross-talk could explain part of the difference, indeed. But 2.5% difference between A and T seems like an awful lot of cross talk: 1 every 40 bases must be wrong (i.e. quality ~15), and this is not reflected in the error rate, which are sky-high there (~35).

      4) I strongly doubt that codon usage could explain such a regularity. Even though a lot of sequences are coding regions in bacteria, I don't see how both direct and complementary strand would be read in the same frame, and this for all the reads... Seems more like a technical thing to me. Is there any step (washing, recalibrating, etc...) that occurs every third cycle in Illumina sequencing?

      Comment


      • #4
        I think you also need to do the plot for some public hiseq data sets from SRA to make sure this is not a failed run. Quality alone is not always telling.

        Comment


        • #5
          I did that for some data from BGI - same machine, same design. The results are quite different. The periodic variation can be observed, but to a much lower extent, and the instability on the first 10-20 nucleotides as well

          On the contrary, the deviation from G=C and A=T is not observed, except at the very end of the read (and not to the same extent than in my samples).

          And I agree, quality alone is not telling, but then the error models should be adapted, because if I have a Q value of 40 and I should expect 100 times more errors, I'd rather know it...

          So you suspect a failed run in my case? And what about the periodic variation, which seems to be present in the BGI run as well? Any idea? I searched SeqAnswers and Google extensively, but as far as I can tell no one has observed that phenomenon...

          Comment


          • #6
            You probably have a few different things going on here. But the stuff going on at the beginning of your reads is probably adapter dimer sequence. You can probably read the peaks off by eye and you will likely see the sequence of whatever adapter was used. I was able to do this using a similar analysis on some SMART cDNAs we decided to throw on the instrument:

            Bridged amplification & clustering followed by sequencing by synthesis. (Genome Analyzer / HiSeq / MiSeq)


            My guess is that the later periodicity would disappear if you removed lower quality bases (below 20 or so) from your data set. (Or at least your plot). At very low quality values the base calls are little more than "guesses". So I think that periodicity is just some bias towards certain bases in even vs. odd cycles.

            --
            Phillip

            Comment


            • #7
              My guess is the starting coordinate of a read/pair is context dependent, for example preferable from certain oligonucleotide. Nonetheless, if you still see a period of 3 from human data, my hypothesis must be wrong.

              The deviation from G=C and A=T in BGI data is expected: cross-talk is mainly an issue at the end of a read. I heard that a recent Illumina pipeline gives lower quality for the first ~10 cycles - they are aware of the issue. But in general, it is hard to get a scoring model that is works well under every artifacts. All the hiseq data I have seen have very low error rate and presumably does not have this G=C and A=T deviation.

              Comment


              • #8
                Originally posted by pmiguel View Post
                My guess is that the later periodicity would disappear if you removed lower quality bases (below 20 or so) from your data set. (Or at least your plot). At very low quality values the base calls are little more than "guesses".
                I will definitely try that.

                Originally posted by pmiguel View Post
                So I think that periodicity is just some bias towards certain bases in even vs. odd cycles.
                Except that the period is 3, not 2, so it can't be even vs. odd...

                I will also test it with data from eukaryotes, to see if the periodicity disappears.
                Thanks!

                Comment


                • #9
                  Originally posted by lionelguy View Post
                  I will definitely try that.
                  Please post your results, if you would...

                  Originally posted by lionelguy View Post
                  Except that the period is 3, not 2, so it can't be even vs. odd...

                  I will also test it with data from eukaryotes, to see if the periodicity disappears.
                  Thanks!
                  The chance that your picking up a biological periodicity seems very low to me. It would require:
                  (1) That such a periodicity exists.
                  (2) That the amplicon construction process is, to some extent, period specific as well. That is, that the fragmentation/ligation process results in a bias in where the reads start in the period.

                  Neither of these seems likely to me.

                  --
                  Phillip

                  Comment


                  • #10
                    The sequencing platform sent me an email, and they checked a couple more things. Shortly:
                    • the bases at the beginning are of lower quality because Illumina changed the base calling software and are now calling first bases with lower quality
                    • they tried to look at nucleotide distribution ignoring all bases with Q < 30, but the periodic variation stays (I don't have their figure though)
                    • they had no explanation for the G/C and A/T imbalance


                    They have contacted Illumina and are waiting for their response.

                    I checked if I could find the periodic variation on an eukaryote sample, without success (I tested one single run from WGS of Sorex araneus, acc SRR099426). But I agree with Phillip, the likelihood to find the codon structure there seems extremely unlikely. I would favor a technical artefact.

                    Comment


                    • #11
                      Originally posted by lionelguy View Post
                      [...]

                      I checked if I could find the periodic variation on an eukaryote sample, without success (I tested one single run from WGS of Sorex araneus, acc SRR099426). But I agree with Phillip, the likelihood to find the codon structure there seems extremely unlikely. I would favor a technical artefact.
                      For "codon structure" sequence biases to be detectable in this manner it would require the adapter to be ligated at the same place in the cycle. That is, all the inserts would need to start at (for example) the peak of your cycle. That seems unlikely to me no matter what the source of DNA is (prokaryotic or eukaryotic).

                      --
                      Phillip

                      Comment


                      • #12
                        We are seeing very weak periodicity, which can be caused by very subtle bias. It is known that the starting position of a read is not entirely random. I am not saying my speculation is a likely explanation. I just think we cannot exclude this possibility unless we see this pattern in vertebrates (one lane showing the periodicity is more than enough), or find a better explanation. Also, non-random read starting position is also a "technical artifact".

                        Comment


                        • #13
                          Have you by any chance used Nextera kit for preparing libraries?
                          Dominika Borek, Ph.D.
                          UT Southwestern Medical Center at Dallas
                          5323 Harry Hines Blvd.
                          Dallas, TX 75390
                          Tel. 214-645-6378
                          Fax. 214-645-6453

                          Comment


                          • #14
                            I read this thread with great interest because I have just discovered the same 3 bp periodicity artifact in some of my recent data, though it's not as strong as the OP's. I have microbial metagenomic libraries sequenced using an overlapping 100 bp PE approach on the HiSeq. I assembled the overlapping reads, quality filtered to require 90% of bases > Q30, and trimmed off leading inline barcodes (though not reverse ones in the graphs shown). The inline barcoded libraries were run 12 per lane with a spike-in of 5% phiX to create greater complexity in the first 4 cycles for improved cluster detection. What's interesting is that the phiX spike-in shows very very weak to no periodicity in GC content (file phix_spikein) while the libraries show a definite 3 bp periodicity in sequence composition (other two files). The phiX control lane showed an extremely subtle 3 bp periodicity in GC content (file phix_controllane_gc). So the periodicity seems to be present across lanes, though it is much stronger in the microbial libraries. I don't believe this to be a quality issue since it shows up in quality filtered, assembled reads and the overall cluster yield was very good.

                            Perhaps lh3 has a point about a biological basis for this 3 bp cycling though I have a hard time envisioning how that would work. Other lanes on the same flow cell as my reads had vertebrate sequence so I'll take a look at that next, but am very interested in any input.

                            PS libraries were prepped with Covaris shearing, Pippin prep size selection, and custom inline adapters. They were quantified with Kapa qPCR before pooling and loading.
                            Attached Files
                            Last edited by greigite; 08-02-2011, 08:14 PM. Reason: add information

                            Comment


                            • #15
                              no periodicity in xenopus data

                              Update on my last post: I do not see the 3 bp periodicity in someone else's Xenopus sequence capture libraries sequenced in another lane of the same flow cell as my microbial libraries, except perhaps for a little bit in the first few cycles. So it appears to be specific to, or exaggerated in, the microbial libraries.
                              Attached Files

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X