Brian, we were talking about this and wondered if you could test the breakage model by looking at the location of duplicates. This would be happening during flow, right? So, if it is breakage then the duplicates should all happen in the direction of flow, with little orthogonal movement. Could you either pull up sets of duplicates and look at the coordinates, or add separate dimension distances for checking for dups and then allow a short pixel distance in one dimension and long in the other and vice versa and see how that affects the results?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
-
Oh, that's an interesting suggestion... theoretically, they shouldn't go upstream at all, or sideways very far. It might not be possible to differentiate between upstream and downstream duplicates (though theoretically, the downstream ones should have a weaker signal), but I can certainly add the ability to differentiate between the X and Y axis. I'll do that and post here when it's done.
I'd imagine that they should make a kind of cone-shaped pattern like the debris field from an airplane crash or tornado, but plotting that kind of thing is tricky since it's an all-or-nothing proposition that doesn't let you see the diminishing probability over the region.Last edited by Brian Bushnell; 03-03-2017, 03:15 PM.
Comment
-
Hasn't the plotting been kind of done in this blog post: https://sequencing.qcfail.com/articl...ted-sequences/ I had posted this over in clumpify thread.
I am wondering if the odd FC-wide duplicates are showing up due to oversampling of libraries (especially for NovaSeq data). Am I completely off-target in suggesting that as a possible cause?
Comment
-
Genomax, I was looking at that blog post again and I thought, but couldn't be sure, that the HS4000 optical duplicates had a bias along the Y axis. I was hoping Brian could replicate that or not. The blog post also noted, "Significantly, 99% of HiSeq 4000 duplicates comprised di-tags originating from the same tile" which seems to be in contrast to the NovaSeq plot Brian produced with its steady increase in duplicates over long distances. Maybe the seeding of fragments onto a Novaseq's flow cell is different and the problem is greater? But your link is clearly relevant!Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
Comment
-
Originally posted by Brian Bushnell View PostI wonder if this is a fundamental limitation of patterned flowcells, and made more pronounced as the dots shrink. When the colony is growing, once a dot is filled, the amplification continues but there is nowhere for the clones on the edges to attach, so some of them break off and drift around. In that case, presumably increasing the loading concentration would reduce the duplicate rate...
Comment
-
Originally posted by Brian Bushnell View PostI'm not really sure about the causes, but they do not seem to correspond with the artifacts associated with oversampling.
As we add more lanes there is diminishing return of new fragments. If that happens then we are basically capturing all sequenceable fragments that are in this library?
Comment
-
The Novaseq read headers look like this:
Code:@VP2-06:112:H7LNDMCVY:2:1105:16224:3004 1:N:0:TCCGGAGA+GGGTCTGA
Comment
-
Well, since I have a 3000 (but no extant PCR-free data) I wanted to look at the public 4000 data from Basespace, specifically:
NA12878-PCRfree450_S3_L003_R1_001.fastq.gz
NA12878-PCRfree450_S3_L003_R2_001.fastq.gz
from the
HiSeq4000: TruSeq PCRfree and Nano (350bp to 550bp insert size)
data set.
From what I can see in my survey of the fastq headers all of the X coordinates are set at '0', hence my confusion.
Edit:
Focusing on tile 1101, the headers go from
@196:2371:H7MF5BBXX:3:1101:0:15712 1:N:0:3
to
@196:2371:H7MF5BBXX:3:1101:0:4312392 1:N:0:3Last edited by GW_OK; 03-17-2017, 05:34 AM.
Comment
-
The forums aren't letting me post a big post so I'm going to break this into three posts.
I've been intrigued with the question of duplicate-well directionality. Does it follow the direction of reagent flow? Setting aside the 4000 data set for a bit I moved over to the NovaSeq data, specifically NA12878-rep1. I pulled down the fastq files from BaseSpace and decided to initially plot (using ggplot) the actual XY coordinates for each read just to see what it looked like. To make visualization easier I focused solely on tile 1105. I still had to use a 10000x10000 png to get the wells spaced out enough.
It's pretty cool to look at. You can make out the ring fiducials quite clearly.
Link to bigger
No way to make out the ordered array, since not every well had a read, though there are what looks like tracks of reads.
Comment
-
@GW_OK: You may have missed this post from QC Fail. They did something similar.
I don't think you need to include "spantiles=t" for NovaSeq (or 4000 data). We have been keeping that off. That is a specific issue with NextSeq and the large clusters it has.
There is some oddity about the "dupedist=" setting as well. We have not been able to nail that one down for NovaSeq.Last edited by GenoMax; 03-20-2017, 05:20 AM.
Comment
-
I then ran clumpifyCode:markduplicates=t dupedist=2500 spantiles=t
Link to bigger
What a giant hairball! You can clearly see that there are libraries duplicating in both the horizontal and vertical direction. Another striking thing is just how long some of the lines are. Since I don't think there's an a priori way of telling which well came first I refrained from assuming directionality. That being said, I'm looking at triplicate (or higher) wells to see if there is a "shotgun" pattern that could indicate a directional "spray".Last edited by GW_OK; 03-20-2017, 09:16 AM.
Comment
Latest Articles
Collapse
-
by seqadmin
Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...-
Channel: Articles
12-16-2024, 07:57 AM -
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 12-17-2024, 10:28 AM
|
0 responses
33 views
0 likes
|
Last Post
by seqadmin
12-17-2024, 10:28 AM
|
||
Started by seqadmin, 12-13-2024, 08:24 AM
|
0 responses
49 views
0 likes
|
Last Post
by seqadmin
12-13-2024, 08:24 AM
|
||
Started by seqadmin, 12-12-2024, 07:41 AM
|
0 responses
34 views
0 likes
|
Last Post
by seqadmin
12-12-2024, 07:41 AM
|
||
Started by seqadmin, 12-11-2024, 07:45 AM
|
0 responses
46 views
0 likes
|
Last Post
by seqadmin
12-11-2024, 07:45 AM
|
Comment