SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Illumina Nextra prep without using Illumina reagents crsimao Illumina/Solexa 4 04-14-2015 11:29 AM
Comparison between SOLiD, Illumina MiSeq and Illumina HiSeq NGS_New_User SOLiD 0 12-12-2012 12:37 PM
bowtie command line for Illumina Hiseq 2000 with Illumina 1.5+ quality encoding files rworthi Illumina/Solexa 4 09-28-2011 12:25 PM

Reply
 
Thread Tools
Old 03-20-2017, 06:25 AM   #61
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

Finally, I graphed duplicate read coordinates relative to the initial read coordinates (x1-x2/y1-y2 from the file I made above). I've attached that file below since it's not tremendously large. Most clump fairly close together, as others have shown, but there does seem to be a Y-bias to my eyes. Perhaps this means that there is some merit to the direction of flow duplicate theory.
Attached Images
File Type: png plot3.png (13.8 KB, 49 views)
GW_OK is offline   Reply With Quote
Old 03-20-2017, 06:36 AM   #62
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

Quote:
Originally Posted by GenoMax View Post
@GW_OK: You may have missed this post from QC Fail. They did something similar.

I don't think you need to include "spantiles=t" for NovaSeq (or 4000 data). We have been keeping that off. That is a specific issue with NextSeq and the large clusters it has.

There is some oddity about the "dupedist=" setting as well. We have not been able to nail that one down for NovaSeq.

So, yeah. I didn't miss that qcfail blog post. I've read through it several times. I wanted to recapitulate their analysis on a data set that:
(A) had not undergone PCR amplification and
(B) was across an entire tile, not just a small region of a tile
(C) was performed by Illumina and/or someone with a vested interest in having their data set showing the theoretical "best" of what the machine can do. It's all well and good to throw a library on two machines but I don't know what that library looked like prior to loading.

I did want to use spantiles to demonstrate the 'mode' of duplication. Are the duplicates moving from well to well, or across the whole tile, or from tile to tile and surface to surface? Based off what I've seen here they're not just moving across interconnected wells.

I picked dupedist 2500 based solely on what people have used for the 4000, as given in the clumpify thread in the bioinformatics subforum.
GW_OK is offline   Reply With Quote
Old 03-20-2017, 06:49 AM   #63
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,548
Default

Can you confirm that the three posts from today use data from NovaSeq (is that data PCR free I don't recollect)? spantiles=t actually spans across all tiles. This was essential to capture the edge-duplicate effect that appears to be specific for NextSeq flowcells.

Since the NovaSeq flowcells should have more nanowells (3-4x?), so using 2500 distance is probably not optimal (though as I said before we have not been able to pin a distance down based on the data available).

I think Illumina is sampling this library do deeply that we are starting to see duplicates across the FC/tiles just because there are only so many sequenceable fragments in the library. I have tried to test this by pooling two lanes of NovaSeq data together to see if the number of clumps does not go up appreciably. Unfortunately I have not been able to get clumpify to work with this pooled data (and @Brian has not had a chance to look at why that is happening).
GenoMax is offline   Reply With Quote
Old 03-20-2017, 06:52 AM   #64
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

It's from the BaseSpace project
Code:
NovaSeq: WGS TruSeq PCR-Free 450 (6plex)
So I reckon it must be PCR-free.

NA12878-rep1 only looking at data from lane 1.
GW_OK is offline   Reply With Quote
Old 03-20-2017, 07:41 AM   #65
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

Regarding sampling depth, I am dubious.

The Truseq PCR-free protocol (which I have to assume they're using) has you start with 1ug for 350bp inserts and 2ug for 550 bp inserts. Since they say they're using 450bp inserts I'll split the difference and say they started with 1.5ug of DNA. The entire human genome weighs 3.6pg (as per IDT) so that 1.5 ug is ~417k human genome equivalents. If they shear it on a Covaris (which is fairly random) you would have to rely on two copies of the genome shearing at the exact same base pair on both ends.

Then, from the fairly random fragment assortment of ~417k genomes you then take ~1.8E10 molecules (assuming they loaded at 200pM) and from that sample 599M molecules.

Someone with more statistical chops than me on a Monday morning can do the actual math but I have a feeling we're not close to oversampling these libraries. I could be wrong, though, so don't hold me to it.

Last edited by GW_OK; 03-20-2017 at 07:44 AM.
GW_OK is offline   Reply With Quote
Old 03-20-2017, 08:12 AM   #66
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,548
Default

I was thinking about "cluster-able/sequence-able" fragments present in the library. In any case this is just a hypothesis and your calculations may be spot-on.

We are approaching data never seen before (e.g. there are 1.6 B reads in largest example data lane and this could potentially go up 4x with S4 cells).

Last edited by GenoMax; 03-20-2017 at 08:55 AM.
GenoMax is offline   Reply With Quote
Old 03-20-2017, 08:39 AM   #67
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

Maybe we need to start talking in terms of "library coverage" instead of "genome coverage"...

Also, don't confuse read numbers with cluster numbers. There'll be 2 reads to every cluster. And on this NovaSeq data you'll have to split their cluster counts across both lanes.
GW_OK is offline   Reply With Quote
Old 03-20-2017, 09:55 AM   #68
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 421
Default

Thanks for the plots, GW_OK!
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is online now   Reply With Quote
Old 03-20-2017, 01:39 PM   #69
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

A few more interesting points.

The tile map for the S2 flowcell:
-The first digit is the lane number: 1 or 2.
-The second digit represents the surface: 1 for top or 2 for bottom.
-The third digit represents the swath number:1, 2, 3, or 4.
-The last 2 digits represent the tile number, 01 through 88. Tile numbering starts with 01 at the outlet end of the flow cell through 88 at the inlet end.

The stuff on BaseSpace is from a pre-release flowcell but I think we can assume the tiling map holds true except their tiles range from 05 to 90.

Mapping the number of intra-tile, inter-tile, and inter-surface-tile duplicates shows that libraries will jump tiles in the direction of flow but less so horizontally or diagonally. There's also a large amount of inter-surface jumping from one surface to the direct opposite. My previous analysis was done on tile 1105, which you can see in the table below in the upper left corner. I also mapped the duplicates in a more centrally located tile (1240) where the duplications are more pronounced surrounding the tile in question.

This seems to reinforce the previous observations that most duplicates stay (relatively) close to their origin with some bias in the Y direction. However there also appears to be a Z-axis bias as well.
Attached Images
File Type: png Screen Shot 2017-03-20 at 3.15.44 PM.png (105.7 KB, 40 views)
File Type: png Screen Shot 2017-03-20 at 3.15.27 PM.png (81.0 KB, 28 views)

Last edited by GW_OK; 03-20-2017 at 01:49 PM.
GW_OK is offline   Reply With Quote
Old 03-21-2017, 11:09 AM   #70
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by GW_OK View Post
What a giant hairball! You can clearly see that there are libraries duplicating in both the horizontal and vertical direction. Another striking thing is just how long some of the lines are. Since I don't think there's an a priori way of telling which well came first I refrained from assuming directionality. That being said, I'm looking at triplicate (or higher) wells to see if there is a "shotgun" pattern that could indicate a directional "spray".
That's downright strange...
Brian Bushnell is offline   Reply With Quote
Old 03-22-2017, 12:55 PM   #71
misterc
Member
 
Location: Livermore, CA

Join Date: Jan 2016
Posts: 21
Default

Does anyone have even a lane's worth of these new .cbcl files from a NovaSeq? I'd like to test our bioinformatics pipeline with the new bcl2fastq converter v.2.19 that supports NovaSeq.
misterc is offline   Reply With Quote
Old 03-22-2017, 12:59 PM   #72
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,548
Default

Quote:
Originally Posted by misterc View Post
Does anyone have even a lane's worth of these new .cbcl files from a NovaSeq? I'd like to test our bioinformatics pipeline with the new bcl2fastq converter v.2.19 that supports NovaSeq.
Illumina does not appear to have made the input files for NovaSeq data available on BaseSpace. Just the outputs.
GenoMax is offline   Reply With Quote
Old 03-22-2017, 05:11 PM   #73
austinso
Member
 
Location: Bay area

Join Date: Jun 2012
Posts: 71
Default

On another note:

150 uL of a 1 nM library (~90 billion molecules) minimum for loading is a lot of library when you consider you can get by with 1.4 billion for the NextSeq and 7 billion for the HiSeq.

FWIW...
austinso is offline   Reply With Quote
Old 03-27-2017, 11:25 PM   #74
misterc
Member
 
Location: Livermore, CA

Join Date: Jan 2016
Posts: 21
Default

Is 150ul of a 1nM library what Illumina recommends for a single S2 flow cell?!?
misterc is offline   Reply With Quote
Old 03-28-2017, 05:43 AM   #75
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,237
Default

Quote:
Originally Posted by austinso View Post
On another note:

150 uL of a 1 nM library (~90 billion molecules) minimum for loading is a lot of library when you consider you can get by with 1.4 billion for the NextSeq and 7 billion for the HiSeq.

FWIW...
Okay I take your point, but an S2 should produce 3 billion clusters per flowcell, whereas a HiSeq 2500 produces about 1.6 billion with v3 chemistry. So the NovaSeq is about 4x less efficient than the HiSeq 2500 in this regard.

A NextSeq produces about 0.4 billion clusters per flowcell. So, the relative efficiencies would be:

(I'm using PF clusters per flowcell / ~number of input amplicon molecules)
HiSeq2500v3 = 1.6/7 = 23%
NextSeq = 0.4/1.4 = 29%
NovaSeqS2 = 3/90 = 3.3%

So, it absolutely looks like a much lower efficiency of clustering on the NovaSeq. (Anyone know if this is also the case for the HiSeq3000/4000?)

That said, how much difference will this make for most runs? If you use the standard HiSeq2500 method, you start with 10ul of a 2nM library pool for denaturation. Since it gets diluted down to 20 pM (at least) you end up with 1 ml for each denaturation you do. One denaturation could be used to cluster all 8 lanes of the flowcell. But how often does that happen?

For us, I can't think of a single case where we have clustered more than 2-3 of lanes per denatured sample pool. Usually it is 8 sample pools for 8 lanes.

There are cases where the amount of library produced is limiting. And the NovaSeq would not be a good choice where this is your critical parameter.

So in most cases I would say it is being forced from 8 lanes to 1 lane along with losing the flexibility to run a much smaller flowcell (with rapid chemistry 2 lane flow cells) that are the major limitation of the NovaSeq.

Illumina expects you to just buy a NextSeq to deal with the 2nd issue above. That would okay (for some definitions of "okay") if they hadn't just decided all the NextSeqs should now have the ability to scan their microarrays. But the option is there.

Then there are the data issues considered in this thread. But I'm pretty sure that is something Illumina can fix (as they had for a period of time with the NextSeq, just after they introduced the v2 version of its chemistry/software) if they focus their attention on it.

--
Phillip
pmiguel is offline   Reply With Quote
Old 03-28-2017, 06:55 AM   #76
GW_OK
Senior Member
 
Location: Oklahoma

Join Date: Sep 2009
Posts: 383
Default

I don't know if you can truly compare efficiencies of the ExAmp chemistry with the other instruments.

On the HiSeq and NextSeq instruments you are randomly clustering across the flowcell with a good correlation between how much DNA you load and how many clusters are produced.

On the ExAmp instruments there are only a fixed number of wells in which clusters can be formed. Additionally, you have to deal with the duplicates coming out of those wells and those duplicates that are formed in solution prior to the library going onto the flowcell.

I think what Illumina is trying to do in ExAmp is saturate the array as practically as possible.

No argument, though, about the loss of flexibility with the NovaSeq. In its' current iteration it's not something useful for an all-comers core lab.
GW_OK is offline   Reply With Quote
Old 04-01-2017, 12:13 PM   #77
austinso
Member
 
Location: Bay area

Join Date: Jun 2012
Posts: 71
Default

Quote:
Originally Posted by misterc View Post
Is 150ul of a 1nM library what Illumina recommends for a single S2 flow cell?!?
Apparently for all of them. And that is the lower end (attached see pg. 16).
Attached Files
File Type: pdf novaseq-6000-system-guide-1000000019358-01.pdf (1.02 MB, 14 views)
austinso is offline   Reply With Quote
Old 04-01-2017, 12:49 PM   #78
austinso
Member
 
Location: Bay area

Join Date: Jun 2012
Posts: 71
Default

Quote:
Originally Posted by pmiguel View Post
Okay I take your point, but an S2 should produce 3 billion clusters per flowcell, whereas a HiSeq 2500 produces about 1.6 billion with v3 chemistry. So the NovaSeq is about 4x less efficient than the HiSeq 2500 in this regard.

A NextSeq produces about 0.4 billion clusters per flowcell. So, the relative efficiencies would be:

(I'm using PF clusters per flowcell / ~number of input amplicon molecules)
HiSeq2500v3 = 1.6/7 = 23%
NextSeq = 0.4/1.4 = 29%
NovaSeqS2 = 3/90 = 3.3%

So, it absolutely looks like a much lower efficiency of clustering on the NovaSeq. (Anyone know if this is also the case for the HiSeq3000/4000?)
Re: 3000/4000
From what I could glean, based on the published specs (which are really vague, perhaps on purpose), the amount of library loaded ranges between 3-9 billion.

The yield is 0.75 billion to ??? billion (I think those that use these should chime in, it is not clear that the total yields stated are per flow cell or for both flow cells).

Mind you the % efficiencies (as you've defined) are way better than the MiSeq (0.3-0.4%) and the MiniSeq (1-5%)

Quote:
That said, how much difference will this make for most runs? If you use the standard HiSeq2500 method, you start with 10ul of a 2nM library pool for denaturation. Since it gets diluted down to 20 pM (at least) you end up with 1 ml for each denaturation you do. One denaturation could be used to cluster all 8 lanes of the flowcell. But how often does that happen?

For us, I can't think of a single case where we have clustered more than 2-3 of lanes per denatured sample pool. Usually it is 8 sample pools for 8 lanes.

There are cases where the amount of library produced is limiting. And the NovaSeq would not be a good choice where this is your critical parameter.

So in most cases I would say it is being forced from 8 lanes to 1 lane along with losing the flexibility to run a much smaller flowcell (with rapid chemistry 2 lane flow cells) that are the major limitation of the NovaSeq.

Illumina expects you to just buy a NextSeq to deal with the 2nd issue above. That would okay (for some definitions of "okay") if they hadn't just decided all the NextSeqs should now have the ability to scan their microarrays. But the option is there.

Then there are the data issues considered in this thread. But I'm pretty sure that is something Illumina can fix (as they had for a period of time with the NextSeq, just after they introduced the v2 version of its chemistry/software) if they focus their attention on it.
I'm not sure that they can improve the % efficiency...it seems like ~30% is about the best you can recover in reads. This would explain why you need more library to get more reads in the NovaSeq.

Mind you 30% is not bad...it is an interesting threshold when you think about occupancy in space.

Cheers, A.
austinso is offline   Reply With Quote
Old 07-14-2017, 05:01 AM   #79
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 187
Default

Forgive this really basic question, but what is the cause of the duplicates on patterned flow cells as opposed to the older HiSeq2500 approach? Is this due to the density of the clusters and the likelihood of a library molecule detaching and then re-attaching a short distance away? Also, how is this different than a PCR duplicate? Is there anyway to tell other than spatial relatedness? (prediction based on XY locale)?
cement_head is offline   Reply With Quote
Old 07-14-2017, 05:30 AM   #80
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,548
Default

@cement_head: See if this blog post helps.
GenoMax is offline   Reply With Quote
Reply

Tags
illumina, novaseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:40 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO