So this is probably the weirdest problem I've ever seen. We have one run from a HiSeq 2500 where it appears that the first 8-15 bases of read #1 will often (but not always) appear at the beginning of read #2. In other words, we have something like the following:
The duplicated part is in upper case and is differs between fragments. The lower case part differs between reads in a pair (as one would expect).
This is happening in a rather large portion of the reads from multiple samples from the same group (all run on the same flow cell and together these samples occupied the entire flow cell). This was noticed because this is a ChIPseq dataset and soft-clipping wasn't initially used in the mapping. Consequently, the paired-end alignment rate was abysmally low (30-60% and this is mouse ChIPseq...).
Once this was brought to my attention I had a look at the data and aligned it with STAR. The alignment rate was then much better (90-95%), with STAR soft-clipping the "ACTGACTGAC" (for example) in read #2. In every case that I've seen, read #1 aligns fully (no soft-clipping, mismatches, or indels) to the genome and read #2 (except for the beginning duplicated sequence that gets soft-clipped) aligns with an appropriate insert size.
I've confirmed that this isn't some weird error that happened during demultiplexing (I wrote a bcl parser this afternoon and parsed matching sequences out of the original bcl files). Further, the library prep was done by our core-facility people, who do a LOT of library prep and haven't seen this sort of thing either before this or since, so it's rather unlikely that something really crazy happened there. My only guess at this point is that something really really weird happened either during the ChIP itself or on the HiSeq. Has anyone seen anything like this before and, if so, were you able to figure out what happened?
Code:
@read1 ACTGACTGACatgctacatcgatgtcat @read2 ACTGACTGACtgacgtagctgtaaatcg
This is happening in a rather large portion of the reads from multiple samples from the same group (all run on the same flow cell and together these samples occupied the entire flow cell). This was noticed because this is a ChIPseq dataset and soft-clipping wasn't initially used in the mapping. Consequently, the paired-end alignment rate was abysmally low (30-60% and this is mouse ChIPseq...).
Once this was brought to my attention I had a look at the data and aligned it with STAR. The alignment rate was then much better (90-95%), with STAR soft-clipping the "ACTGACTGAC" (for example) in read #2. In every case that I've seen, read #1 aligns fully (no soft-clipping, mismatches, or indels) to the genome and read #2 (except for the beginning duplicated sequence that gets soft-clipped) aligns with an appropriate insert size.
I've confirmed that this isn't some weird error that happened during demultiplexing (I wrote a bcl parser this afternoon and parsed matching sequences out of the original bcl files). Further, the library prep was done by our core-facility people, who do a LOT of library prep and haven't seen this sort of thing either before this or since, so it's rather unlikely that something really crazy happened there. My only guess at this point is that something really really weird happened either during the ChIP itself or on the HiSeq. Has anyone seen anything like this before and, if so, were you able to figure out what happened?
Comment