SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
BOWTIE2: How to randomly map a read to single alignment when read is multimapping? cbaudo Bioinformatics 5 01-27-2016 09:33 AM
Tophat2 read-gap-length and read-mismatches max acceptable values ElizabethRoss RNA Sequencing 0 10-13-2014 03:38 PM
Tools that report read count AND read names that map to genomic features. foolishbrat Bioinformatics 1 02-05-2014 12:21 AM
Miseq:Trimming, and sequencing primers at the beginning of a read clintp Illumina/Solexa 5 01-15-2014 12:38 PM
Merge variable-length adaptor from beginning of read sowalsky Bioinformatics 0 11-12-2012 12:27 PM

Reply
 
Thread Tools
Old 01-25-2017, 11:26 AM   #1
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,470
Default Intermittent inclusion of the beginning of read 1 in read 2

So this is probably the weirdest problem I've ever seen. We have one run from a HiSeq 2500 where it appears that the first 8-15 bases of read #1 will often (but not always) appear at the beginning of read #2. In other words, we have something like the following:

Code:
@read1
ACTGACTGACatgctacatcgatgtcat
@read2
ACTGACTGACtgacgtagctgtaaatcg
The duplicated part is in upper case and is differs between fragments. The lower case part differs between reads in a pair (as one would expect).

This is happening in a rather large portion of the reads from multiple samples from the same group (all run on the same flow cell and together these samples occupied the entire flow cell). This was noticed because this is a ChIPseq dataset and soft-clipping wasn't initially used in the mapping. Consequently, the paired-end alignment rate was abysmally low (30-60% and this is mouse ChIPseq...).

Once this was brought to my attention I had a look at the data and aligned it with STAR. The alignment rate was then much better (90-95%), with STAR soft-clipping the "ACTGACTGAC" (for example) in read #2. In every case that I've seen, read #1 aligns fully (no soft-clipping, mismatches, or indels) to the genome and read #2 (except for the beginning duplicated sequence that gets soft-clipped) aligns with an appropriate insert size.

I've confirmed that this isn't some weird error that happened during demultiplexing (I wrote a bcl parser this afternoon and parsed matching sequences out of the original bcl files). Further, the library prep was done by our core-facility people, who do a LOT of library prep and haven't seen this sort of thing either before this or since, so it's rather unlikely that something really crazy happened there. My only guess at this point is that something really really weird happened either during the ChIP itself or on the HiSeq. Has anyone seen anything like this before and, if so, were you able to figure out what happened?
dpryan is offline   Reply With Quote
Old 01-26-2017, 04:44 AM   #2
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,213
Default

Which library construction kit was used? Some now include methods to add at each end of an insert some random sequence of a known length. Bioo, for instance, uses this to reduce ligation site bias. But that would produce different sequence at either end.
I think there are kits that add the same tag on both ends -- which could be used to eliminate chimeric clones. (Although I wouldn't think this would be a big issue for ChIP libraries...)

--
Phillip
pmiguel is offline   Reply With Quote
Old 01-26-2017, 04:46 AM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,470
Default

Some sort of NEB kit, from what I've been told. It's the same kit that's used to construct all of the other ChIPseq libraries, none of which have produced this sort of effect (either prior to this run or since).
dpryan is offline   Reply With Quote
Old 01-26-2017, 06:12 AM   #4
SylvainL
Senior Member
 
Location: Geneva

Join Date: Feb 2012
Posts: 173
Default

Were the libraries prepared using tagmentation?
SylvainL is offline   Reply With Quote
Old 01-26-2017, 06:13 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,470
Default

No, this was your standard ChIPseq sort of library prep, no tagmentation.
dpryan is offline   Reply With Quote
Old 01-26-2017, 06:18 AM   #6
SylvainL
Senior Member
 
Location: Geneva

Join Date: Feb 2012
Posts: 173
Default

just to be sure I understood: the upper case sequence is on the genome (meaning, it's really present in read1), while the problem concerns only the read2, so it's kind of inverted repeat on the genome, but you don't find this repeat on the genome.
SylvainL is offline   Reply With Quote
Old 01-26-2017, 06:20 AM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,470
Default

Quote:
Originally Posted by SylvainL View Post
just to be sure I understood: the upper case sequence is on the genome (meaning, it's really present in read1), while the problem concerns only the read2, so it's kind of inverted repeat on the genome, but you don't find this repeat on the genome.
Yes, exactly.
dpryan is offline   Reply With Quote
Old 01-26-2017, 06:28 AM   #8
SylvainL
Senior Member
 
Location: Geneva

Join Date: Feb 2012
Posts: 173
Default


I will be interested by the explanation then

I thought it could be a tagmentation followed by a Klenow repair which would keep the transposae "signature", but even like that, you wouldn't expect to have exactly the same sequence of each pair...
SylvainL is offline   Reply With Quote
Old 01-26-2017, 06:49 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,413
Default

Quote:
Originally Posted by dpryan View Post
This is happening in a rather large portion of the reads from multiple samples from the same group (all run on the same flow cell and together these samples occupied the entire flow cell).
Don't want to be a conspiracy theorist but perhaps there is an explanation hidden in whatever the group is doing to prep the samples. Since you are experienced on both sides of world perhaps talking with whoever made the preps/libraries may root a cause out.

Is this n=1 (even though for multiple samples) and/or a repeated observation across multiple runs? You could also make Illumina aware by submitting a ticket. Perhaps someone else has reported something to them before.

Last edited by GenoMax; 01-26-2017 at 06:52 AM.
GenoMax is offline   Reply With Quote
Old 01-26-2017, 09:50 AM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,470
Default

Quote:
Originally Posted by GenoMax View Post
Don't want to be a conspiracy theorist but perhaps there is an explanation hidden in whatever the group is doing to prep the samples. Since you are experienced on both sides of world perhaps talking with whoever made the preps/libraries may root a cause out.

Is this n=1 (even though for multiple samples) and/or a repeated observation across multiple runs? You could also make Illumina aware by submitting a ticket. Perhaps someone else has reported something to them before.
Yeah, one of our guesses would be that something went weird when the group did its IP, but we'll have to wait until the post-doc who did that is back from vacation to ask. Having said that, I'm not even sure how one could get this to happen during an IP (granted, the post-docs do enjoy coming up with new and creative ways of causing problems...).

This was an n=1 occurrence, we've had a few other (unproblematic) projects from this particular post-doc (many many more from his lab).
dpryan is offline   Reply With Quote
Old 01-27-2017, 07:25 AM   #11
microgirl123
Senior Member
 
Location: New England

Join Date: Jun 2012
Posts: 188
Default

If you Google your capitalized sequence, it comes up as a motif that matches "Pbx3(Homeobox)/GM12878-PBX3-ChIP-Seq/Homer." That means nothing to me, but maybe it does to you or someone else?
microgirl123 is offline   Reply With Quote
Old 01-27-2017, 02:29 PM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,470
Default

The example is just random sequence that I typed in. In the real dataset, it varies by read. It's all mouse DNA and matches where ever read 1 aligns.
dpryan is offline   Reply With Quote
Old 01-27-2017, 03:54 PM   #13
nucacidhunter
Senior Member
 
Location: Iran

Join Date: Jan 2013
Posts: 1,014
Default

So far, information in this thread can be summarised as following:
1- Initial 8-15 sequences of Read2 in some pairs are identical to Read1
2- These sequences are from the genome as Read1 directly and Read2 after soft clipping perfectly maps to the reference and the distanced matches library insert sizes
3- It is not the results of bcl2fastq software settings


Possible explanations:
1- Sequences are present in the library fragments (not known)
2- Sequences were added during sequencing steps (not known)
3- Sequences were generated by RTA software (not known)
4- Sequences were generated by bcl2fastq (ruled out)

I would be interested to know the run set up (reads and index cycles). This seems unexplainable and I would suggest spiking (%5) couple of the libraries with the highest incident of this observation to a non-related library run to check data reproducibility.
nucacidhunter is offline   Reply With Quote
Old 01-28-2017, 06:37 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,413
Default

I like the idea of spiking the problem libraries and re-sequencing with a random pool to verify the result.

Sanger sequencing to confirm presence of those bases?

Last edited by GenoMax; 01-28-2017 at 06:42 AM.
GenoMax is offline   Reply With Quote
Old 01-28-2017, 11:26 AM   #15
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,470
Default

We all agreed to do a spike-in of the worst sample on an upcoming run. I'm curious to see what happens. I'll post back when I get some results.
dpryan is offline   Reply With Quote
Old 01-28-2017, 04:24 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,641
Default

This sequence is a simple repeat identical to its own complement (not reverse-complement) when shifted by two positions, which makes me suspicious. Just a general suspicion, though; I can't identify any culprits. If not for dpryan's extraordinary diligence in analyzing the raw data I would have thought it was a software problem. But since that seems to have been ruled out, I wonder if the self-complementary nature might be important. Have you tried aligning read1 and read2 independently to the reference, to see if in that context, read2 might map without soft-clipping?
Brian Bushnell is offline   Reply With Quote
Old 01-28-2017, 08:16 PM   #17
nucacidhunter
Senior Member
 
Location: Iran

Join Date: Jan 2013
Posts: 1,014
Default

It still could be base caller software (RTA) that creates BCL files. This is more likely if the run was paused and restarted or the available disk space was limited sometime during run.

PS. Since only a subset of reads have been affected it might be useful to check if they are from the same tile or random.

Last edited by nucacidhunter; 01-28-2017 at 08:26 PM.
nucacidhunter is offline   Reply With Quote
Old 01-30-2017, 04:36 AM   #18
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,470
Default

@nucaidhunter: They're from multiple lanes even. That's what makes it all so weird.

@Brian: Read 1 will align very well, read 2 will only align with soft-clipping.
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO