Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intermittent inclusion of the beginning of read 1 in read 2

    So this is probably the weirdest problem I've ever seen. We have one run from a HiSeq 2500 where it appears that the first 8-15 bases of read #1 will often (but not always) appear at the beginning of read #2. In other words, we have something like the following:

    Code:
    @read1
    ACTGACTGACatgctacatcgatgtcat
    @read2
    ACTGACTGACtgacgtagctgtaaatcg
    The duplicated part is in upper case and is differs between fragments. The lower case part differs between reads in a pair (as one would expect).

    This is happening in a rather large portion of the reads from multiple samples from the same group (all run on the same flow cell and together these samples occupied the entire flow cell). This was noticed because this is a ChIPseq dataset and soft-clipping wasn't initially used in the mapping. Consequently, the paired-end alignment rate was abysmally low (30-60% and this is mouse ChIPseq...).

    Once this was brought to my attention I had a look at the data and aligned it with STAR. The alignment rate was then much better (90-95%), with STAR soft-clipping the "ACTGACTGAC" (for example) in read #2. In every case that I've seen, read #1 aligns fully (no soft-clipping, mismatches, or indels) to the genome and read #2 (except for the beginning duplicated sequence that gets soft-clipped) aligns with an appropriate insert size.

    I've confirmed that this isn't some weird error that happened during demultiplexing (I wrote a bcl parser this afternoon and parsed matching sequences out of the original bcl files). Further, the library prep was done by our core-facility people, who do a LOT of library prep and haven't seen this sort of thing either before this or since, so it's rather unlikely that something really crazy happened there. My only guess at this point is that something really really weird happened either during the ChIP itself or on the HiSeq. Has anyone seen anything like this before and, if so, were you able to figure out what happened?

  • #2
    Which library construction kit was used? Some now include methods to add at each end of an insert some random sequence of a known length. Bioo, for instance, uses this to reduce ligation site bias. But that would produce different sequence at either end.
    I think there are kits that add the same tag on both ends -- which could be used to eliminate chimeric clones. (Although I wouldn't think this would be a big issue for ChIP libraries...)

    --
    Phillip

    Comment


    • #3
      Some sort of NEB kit, from what I've been told. It's the same kit that's used to construct all of the other ChIPseq libraries, none of which have produced this sort of effect (either prior to this run or since).

      Comment


      • #4
        Were the libraries prepared using tagmentation?

        Comment


        • #5
          No, this was your standard ChIPseq sort of library prep, no tagmentation.

          Comment


          • #6
            just to be sure I understood: the upper case sequence is on the genome (meaning, it's really present in read1), while the problem concerns only the read2, so it's kind of inverted repeat on the genome, but you don't find this repeat on the genome.

            Comment


            • #7
              Originally posted by SylvainL View Post
              just to be sure I understood: the upper case sequence is on the genome (meaning, it's really present in read1), while the problem concerns only the read2, so it's kind of inverted repeat on the genome, but you don't find this repeat on the genome.
              Yes, exactly.

              Comment


              • #8

                I will be interested by the explanation then

                I thought it could be a tagmentation followed by a Klenow repair which would keep the transposae "signature", but even like that, you wouldn't expect to have exactly the same sequence of each pair...

                Comment


                • #9
                  Originally posted by dpryan View Post
                  This is happening in a rather large portion of the reads from multiple samples from the same group (all run on the same flow cell and together these samples occupied the entire flow cell).
                  Don't want to be a conspiracy theorist but perhaps there is an explanation hidden in whatever the group is doing to prep the samples. Since you are experienced on both sides of world perhaps talking with whoever made the preps/libraries may root a cause out.

                  Is this n=1 (even though for multiple samples) and/or a repeated observation across multiple runs? You could also make Illumina aware by submitting a ticket. Perhaps someone else has reported something to them before.
                  Last edited by GenoMax; 01-26-2017, 07:52 AM.

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    Don't want to be a conspiracy theorist but perhaps there is an explanation hidden in whatever the group is doing to prep the samples. Since you are experienced on both sides of world perhaps talking with whoever made the preps/libraries may root a cause out.

                    Is this n=1 (even though for multiple samples) and/or a repeated observation across multiple runs? You could also make Illumina aware by submitting a ticket. Perhaps someone else has reported something to them before.
                    Yeah, one of our guesses would be that something went weird when the group did its IP, but we'll have to wait until the post-doc who did that is back from vacation to ask. Having said that, I'm not even sure how one could get this to happen during an IP (granted, the post-docs do enjoy coming up with new and creative ways of causing problems...).

                    This was an n=1 occurrence, we've had a few other (unproblematic) projects from this particular post-doc (many many more from his lab).

                    Comment


                    • #11
                      If you Google your capitalized sequence, it comes up as a motif that matches "Pbx3(Homeobox)/GM12878-PBX3-ChIP-Seq/Homer." That means nothing to me, but maybe it does to you or someone else?

                      Comment


                      • #12
                        The example is just random sequence that I typed in. In the real dataset, it varies by read. It's all mouse DNA and matches where ever read 1 aligns.

                        Comment


                        • #13
                          So far, information in this thread can be summarised as following:
                          1- Initial 8-15 sequences of Read2 in some pairs are identical to Read1
                          2- These sequences are from the genome as Read1 directly and Read2 after soft clipping perfectly maps to the reference and the distanced matches library insert sizes
                          3- It is not the results of bcl2fastq software settings


                          Possible explanations:
                          1- Sequences are present in the library fragments (not known)
                          2- Sequences were added during sequencing steps (not known)
                          3- Sequences were generated by RTA software (not known)
                          4- Sequences were generated by bcl2fastq (ruled out)

                          I would be interested to know the run set up (reads and index cycles). This seems unexplainable and I would suggest spiking (%5) couple of the libraries with the highest incident of this observation to a non-related library run to check data reproducibility.

                          Comment


                          • #14
                            I like the idea of spiking the problem libraries and re-sequencing with a random pool to verify the result.

                            Sanger sequencing to confirm presence of those bases?
                            Last edited by GenoMax; 01-28-2017, 07:42 AM.

                            Comment


                            • #15
                              We all agreed to do a spike-in of the worst sample on an upcoming run. I'm curious to see what happens. I'll post back when I get some results.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X