SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Illumina Nextra prep without using Illumina reagents crsimao Illumina/Solexa 4 04-14-2015 10:29 AM
Comparison between SOLiD, Illumina MiSeq and Illumina HiSeq NGS_New_User SOLiD 0 12-12-2012 11:37 AM
bowtie command line for Illumina Hiseq 2000 with Illumina 1.5+ quality encoding files rworthi Illumina/Solexa 4 09-28-2011 11:25 AM

Reply
 
Thread Tools
Old 03-01-2017, 11:21 PM   #41
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Here is a zoomed-in image of HiSeq 2500 duplicates for the same genome (it's an immortal human cell line that does not need amplification, or so I'm told).



This is not the same as the other image, as the x-axis is logarithmic rather than linear. But the important point in my opinion is that there is a rapid increase in duplicates detected up to a point (~45) and subsequently it is completely flat for a long time. That is what I expect from a platform that occasionally identifies oddly-shaped clusters as two clusters, or in which a well occasionally migrates to an adjacent well.

At ~1000, it starts going up again. I'm not sure about that - I would expect it to be sub-linear on the log scale, but then, I'm not sure what's happening in that region. The salient point is that there is a sharp increase over roughly the width of a cluster, and then a plateau, and finally another increase due to the increasing range. After dist=1000, I can't explain the slope. But, the graph only shows duplicates of less than 0.02% of reads, so it's not very important in practice. Still, it would be great if there was one less unsolved mystery.
Attached Images
File Type: png HighSeq_Duplicates.png (32.8 KB, 235 views)

Last edited by Brian Bushnell; 03-01-2017 at 11:32 PM.
Brian Bushnell is offline   Reply With Quote
Old 03-02-2017, 03:10 AM   #42
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,235
Default

Hi Brian,
Are you scoring the same number of reads with HiSeq/NovaSeq? If the number of reads for the NovaSeq were an order of magnitude higher, then repetitive or mitochondrial DNA then you might be able to use up all of the possible start sites.

Are you scoring clusters as a duplicate only if both forward and reverse reads are the same? Or are you only checking one side?

BTW, yes, a typical DNA prep from cell culture would yield enough DNA to make it unnecessary to amplify the library.

--
Phillip
pmiguel is offline   Reply With Quote
Old 03-02-2017, 03:35 AM   #43
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,491
Default

Quote:
Originally Posted by pmiguel View Post
Hi Brian,
Are you scoring the same number of reads with HiSeq/NovaSeq? If the number of reads for the NovaSeq were an order of magnitude higher, then repetitive or mitochondrial DNA then you might be able to use up all of the possible start sites.
Probably not since best NovaSeq sample posted on BaseSpace has 1.6 Billion reads (individual R1 and R2 files, if uncompressed are 300G each!, we have the possibility of having uncompressed read files of 1TB each when S4 cells roll around later this year).
Quote:
Are you scoring clusters as a duplicate only if both forward and reverse reads are the same? Or are you only checking one side?
That should be a yes since @Brian is probably using clumpify which takes both reads into account.

I am wondering if we are sampling the libraries so thoroughly on a NovaSeq that we have duplicates showing up due to oversampling.

Last edited by GenoMax; 03-02-2017 at 07:14 AM.
GenoMax is offline   Reply With Quote
Old 03-02-2017, 07:12 AM   #44
misterc
Member
 
Location: Livermore, CA

Join Date: Jan 2016
Posts: 20
Default

Brian, your hypothesis is reasonable as there is no other possibility to explain the duplicate rate. Not surprisingly, we see similar duplicates on HiSeq 4000, as this 'characteristic' of ExAmp isn't limited to NovaSeq.
misterc is offline   Reply With Quote
Old 03-02-2017, 08:12 AM   #45
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by pmiguel View Post
Hi Brian,
Are you scoring the same number of reads with HiSeq/NovaSeq? If the number of reads for the NovaSeq were an order of magnitude higher, then repetitive or mitochondrial DNA then you might be able to use up all of the possible start sites.
I might try running again after removing the mito, but it's not like mito accounts for >12% of the reads anyway. The number of reads was different, but this NovaSeq library only has twice the reads of the HiSeq library, so that doesn't explain the result.

Quote:
Are you scoring clusters as a duplicate only if both forward and reverse reads are the same? Or are you only checking one side?
As Genomax indicated, yes, with this methodology both reads in a pair are required to match for the pair to be considered a duplicate. Due to the large insert size and variance this is unlikely to occur by chance.

Quote:
Originally Posted by misterc
Brian, your hypothesis is reasonable as there is no other possibility to explain the duplicate rate. Not surprisingly, we see similar duplicates on HiSeq 4000, as this 'characteristic' of ExAmp isn't limited to NovaSeq.
I wonder if this is a fundamental limitation of patterned flowcells, and made more pronounced as the dots shrink. When the colony is growing, once a dot is filled, the amplification continues but there is nowhere for the clones on the edges to attach, so some of them break off and drift around. In that case, presumably increasing the loading concentration would reduce the duplicate rate...

But, it makes me wonder what the duplicate rates of the high-throughput flowcells will look like.
Brian Bushnell is offline   Reply With Quote
Old 03-03-2017, 12:04 PM   #46
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 416
Default

Brian, we were talking about this and wondered if you could test the breakage model by looking at the location of duplicates. This would be happening during flow, right? So, if it is breakage then the duplicates should all happen in the direction of flow, with little orthogonal movement. Could you either pull up sets of duplicates and look at the coordinates, or add separate dimension distances for checking for dups and then allow a short pixel distance in one dimension and long in the other and vice versa and see how that affects the results?
__________________
Providing nextRAD genotyping services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 03-03-2017, 02:11 PM   #47
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Oh, that's an interesting suggestion... theoretically, they shouldn't go upstream at all, or sideways very far. It might not be possible to differentiate between upstream and downstream duplicates (though theoretically, the downstream ones should have a weaker signal), but I can certainly add the ability to differentiate between the X and Y axis. I'll do that and post here when it's done.

I'd imagine that they should make a kind of cone-shaped pattern like the debris field from an airplane crash or tornado, but plotting that kind of thing is tricky since it's an all-or-nothing proposition that doesn't let you see the diminishing probability over the region.

Last edited by Brian Bushnell; 03-03-2017 at 02:15 PM.
Brian Bushnell is offline   Reply With Quote
Old 03-03-2017, 05:43 PM   #48
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,491
Default

Hasn't the plotting been kind of done in this blog post: https://sequencing.qcfail.com/articl...ted-sequences/ I had posted this over in clumpify thread.

I am wondering if the odd FC-wide duplicates are showing up due to oversampling of libraries (especially for NovaSeq data). Am I completely off-target in suggesting that as a possible cause?
GenoMax is offline   Reply With Quote
Old 03-03-2017, 06:34 PM   #49
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I'm not really sure about the causes, but they do not seem to correspond with the artifacts associated with oversampling.
Brian Bushnell is offline   Reply With Quote
Old 03-03-2017, 10:37 PM   #50
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 416
Default

Genomax, I was looking at that blog post again and I thought, but couldn't be sure, that the HS4000 optical duplicates had a bias along the Y axis. I was hoping Brian could replicate that or not. The blog post also noted, "Significantly, 99% of HiSeq 4000 duplicates comprised di-tags originating from the same tile" which seems to be in contrast to the NovaSeq plot Brian produced with its steady increase in duplicates over long distances. Maybe the seeding of fragments onto a Novaseq's flow cell is different and the problem is greater? But your link is clearly relevant!
__________________
Providing nextRAD genotyping services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 03-04-2017, 02:46 AM   #51
nucacidhunter
Senior Member
 
Location: Iran

Join Date: Jan 2013
Posts: 1,059
Default

Quote:
Originally Posted by Brian Bushnell View Post
I wonder if this is a fundamental limitation of patterned flowcells, and made more pronounced as the dots shrink. When the colony is growing, once a dot is filled, the amplification continues but there is nowhere for the clones on the edges to attach, so some of them break off and drift around. In that case, presumably increasing the loading concentration would reduce the duplicate rate...
I think this is limitation of ExAmp cluster amplification rather than patterned flow cell. With ExAmp reducing loading concentration increases duplication rate as a fragment seeding one Nano-well will have more chance to seed other wells as well. Once there are more data from NovaSeq this can be further investigated.
nucacidhunter is offline   Reply With Quote
Old 03-04-2017, 04:14 AM   #52
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,491
Default

Quote:
Originally Posted by Brian Bushnell View Post
I'm not really sure about the causes, but they do not seem to correspond with the artifacts associated with oversampling.
I was thinking (perhaps simplistically) that the test NovaSeq data on BaseSpace is probably the same library loaded on multiple lanes. If we were to run clumpify across more than 2 (or even all) lanes then that would basically give us a collection of fragments that all have the same sequence.

As we add more lanes there is diminishing return of new fragments. If that happens then we are basically capturing all sequenceable fragments that are in this library?
GenoMax is offline   Reply With Quote
Old 03-16-2017, 06:06 AM   #53
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

Can anyone share exactly how they're getting X/Y coordinates from the patterned flowcell fastq? I'm only seeing a single number, which I am guessing corresponds to a well ID.
GW_OK is offline   Reply With Quote
Old 03-16-2017, 09:07 AM   #54
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

The Novaseq read headers look like this:

Code:
@VP2-06:112:H7LNDMCVY:2:1105:16224:3004 1:N:0:TCCGGAGA+GGGTCTGA
In this case, 2:1105:16224:3004 is the positional information, in the format "lane:tile:X:Y". I got this data from Basespace; it's possible that SRA data has the read headers changed.
Brian Bushnell is offline   Reply With Quote
Old 03-16-2017, 10:58 AM   #55
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,491
Default

@GW_OK: Where did you get your data from? I got mine from BaseSpace and it looks like normal Illumina fastq data.
GenoMax is offline   Reply With Quote
Old 03-16-2017, 04:46 PM   #56
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

Well, since I have a 3000 (but no extant PCR-free data) I wanted to look at the public 4000 data from Basespace, specifically:
NA12878-PCRfree450_S3_L003_R1_001.fastq.gz
NA12878-PCRfree450_S3_L003_R2_001.fastq.gz
from the
HiSeq4000: TruSeq PCRfree and Nano (350bp to 550bp insert size)
data set.

From what I can see in my survey of the fastq headers all of the X coordinates are set at '0', hence my confusion.

Edit:
Focusing on tile 1101, the headers go from
@196:2371:H7MF5BBXX:3:1101:0:15712 1:N:0:3
to
@196:2371:H7MF5BBXX:3:1101:0:4312392 1:N:0:3

Last edited by GW_OK; 03-17-2017 at 05:34 AM.
GW_OK is offline   Reply With Quote
Old 03-18-2017, 06:20 AM   #57
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

deleted due to duplication (hah)

Last edited by GW_OK; 03-20-2017 at 05:46 AM.
GW_OK is offline   Reply With Quote
Old 03-20-2017, 05:12 AM   #58
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

The forums aren't letting me post a big post so I'm going to break this into three posts.

I've been intrigued with the question of duplicate-well directionality. Does it follow the direction of reagent flow? Setting aside the 4000 data set for a bit I moved over to the NovaSeq data, specifically NA12878-rep1. I pulled down the fastq files from BaseSpace and decided to initially plot (using ggplot) the actual XY coordinates for each read just to see what it looked like. To make visualization easier I focused solely on tile 1105. I still had to use a 10000x10000 png to get the wells spaced out enough.

It's pretty cool to look at. You can make out the ring fiducials quite clearly.

Link to bigger

No way to make out the ordered array, since not every well had a read, though there are what looks like tracks of reads.
GW_OK is offline   Reply With Quote
Old 03-20-2017, 05:14 AM   #59
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,491
Default

@GW_OK: You may have missed this post from QC Fail. They did something similar.

I don't think you need to include "spantiles=t" for NovaSeq (or 4000 data). We have been keeping that off. That is a specific issue with NextSeq and the large clusters it has.

There is some oddity about the "dupedist=" setting as well. We have not been able to nail that one down for NovaSeq.

Last edited by GenoMax; 03-20-2017 at 05:20 AM.
GenoMax is offline   Reply With Quote
Old 03-20-2017, 05:17 AM   #60
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

I then ran clumpify
Code:
markduplicates=t dupedist=2500 spantiles=t
to coalesce the duplicates I used a simple perl script to parse the fastq headers into a "tile1 x1 y1 tile2 x2 y2" text file I could use in ggplot to draw lines between duplicate wells. The first coordinates are the "initial" reads as given by clumpify and the second coordinates are the "duplicate" reads as labeled by clumpify. I pulled out all of the duplicate sets where both wells were within tile 1105. I was quite struck by how many wells duplicated over to the 2xxx tileset, which is the bottom surface while 1105 is on the top.


Link to bigger

What a giant hairball! You can clearly see that there are libraries duplicating in both the horizontal and vertical direction. Another striking thing is just how long some of the lines are. Since I don't think there's an a priori way of telling which well came first I refrained from assuming directionality. That being said, I'm looking at triplicate (or higher) wells to see if there is a "shotgun" pattern that could indicate a directional "spray".

Last edited by GW_OK; 03-20-2017 at 09:16 AM.
GW_OK is offline   Reply With Quote
Reply

Tags
illumina, novaseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:10 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO