SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
HiSeq4000 small RNA libraries (vs 2500) tjhu11 Core Facilities 10 03-14-2017 08:04 PM
Illumina NextSeq vs. HiSeq4000 lewewoo Core Facilities 2 09-20-2016 03:23 AM
Does anybody have experience with Hiseq4000 on ChIp-Seq analysis? lucreteus Bioinformatics 0 03-01-2015 05:45 AM
High number of optical duplicates on MiSeq mareen_engel Illumina/Solexa 4 02-05-2015 11:34 PM
RNAseq - removing optical duplicates only BENGwins Bioinformatics 1 11-14-2012 08:37 AM

Reply
 
Thread Tools
Old 04-01-2016, 05:49 AM   #1
Nebetbastet
Junior Member
 
Location: France

Join Date: Apr 2016
Posts: 7
Question Optical duplicates Hiseq4000

Dear all,

I am working with RNA data sequenced on the Hiseq4000 sequencer. I am trying to quantify the number of "optical duplicates" or "clustering duplicates". These duplicates appear when reads in nearby wells result from secondary exAmp seeding from a primary well when concentrations are sub-optimal.

I used MarkDuplicates (Picard 2.1.1) and followed this procedure : http://gatkforums.broadinstitute.org...swithmatecigar

But each time, MarkDuplicates find "0 optical duplicate clusters"...

I tested two alignement tools: TopHat and BWA, but each time, MarkDuplicates find no optical duplicate.

I tried on 96 samples.

Do you have any idea of why I cannot find any optical duplicate?

Thank you very much
Nebetbastet is offline   Reply With Quote
Old 04-01-2016, 06:24 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

Can you provide some additional information? Is this a PE dataset? What was the PF% for the lanes (I assume these 96 samples came from one flowcell)? What are the alignment % for the aligners you have used?
GenoMax is offline   Reply With Quote
Old 04-04-2016, 12:02 AM   #3
Nebetbastet
Junior Member
 
Location: France

Join Date: Apr 2016
Posts: 7
Default

Thank you GenoMax for your answer.

- It is a 50bp single-end dataset
- Bcl2fastq tells me that the "%PF Clusters" is 100% for all the samples
- Using Tophat, the percentage of mapped reads ranges from 73.3% to 96.4%, with a median equal to 93.5%.
- I used BWA only on one sample: I found that 93.3% of reads mapped to the reference genome

Thank you in advance for your help
Nebetbastet is offline   Reply With Quote
Old 04-04-2016, 05:01 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

That seems a bit odd. Based on the training for HiSeq 4000 we were told that the sweet spot for PF is around 70%. Any more (once you get closer to 75%) would indicate that there will be a lot duplicates.

When running Picard MarkDuplicates did you adjust the OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 as recommended in the link you had posted above?

Perhaps you got lucky (and/or you have a library of excellent quality) and there are no duplicates. Though that seems a bit too good to be true.
GenoMax is offline   Reply With Quote
Old 04-04-2016, 05:03 AM   #5
Nebetbastet
Junior Member
 
Location: France

Join Date: Apr 2016
Posts: 7
Default

Thank you for your answer.

Yes, I adjusted at 2500 as indicated in the link.

As you say, I find it's a little too good to be true...
Nebetbastet is offline   Reply With Quote
Old 04-04-2016, 05:10 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

Have you contacted tech support? It may be worth getting their take on this.

I am finding 100% PF hard to believe. Are there really 3.2B reads in your dataset? Does the quality look fine?
GenoMax is offline   Reply With Quote
Old 04-04-2016, 06:07 AM   #7
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by Nebetbastet View Post
- Bcl2fastq tells me that the "%PF Clusters" is 100% for all the samples
Quote:
Originally Posted by GenoMax View Post
Have you contacted tech support? It may be worth getting their take on this.

I am finding 100% PF hard to believe. Are there really 3.2B reads in your dataset? Does the quality look fine?
This is just a reporting quirk when you run Bcl2fastq without using the "--with-failed-reads" option. Since it is only converting and demultiplexing PF reads it reports them as 100% PF.

NOTE: This is true for Bcl2fastq v1.8.4. I have never tested the newer, 2.x versions of Bcl2fastq.
kmcarr is offline   Reply With Quote
Old 04-04-2016, 06:16 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

It would be odd if bcl2fastq v.2 was run with "--with-failed-reads" option but that may be a logical explanation for the 100% PF observation.
GenoMax is offline   Reply With Quote
Old 04-19-2016, 04:13 AM   #9
Nebetbastet
Junior Member
 
Location: France

Join Date: Apr 2016
Posts: 7
Default

Hi,

Sorry for my slow reply. I was investigating for the 100% PF... Actually, this is a wrong number. The %PF is 71%.
Nebetbastet is offline   Reply With Quote
Old 04-19-2016, 04:24 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

That sounds more logical. Any update on optical duplicates? I have not been able to replicate the settings recommended on GATK site for a small number of samples I have tried.

See this for an update on how samtools/GATK may handle this in future.
GenoMax is offline   Reply With Quote
Old 04-19-2016, 04:30 AM   #11
Nebetbastet
Junior Member
 
Location: France

Join Date: Apr 2016
Posts: 7
Default

No, no update
Thank you for the link to this discussion !
Nebetbastet is offline   Reply With Quote
Old 05-18-2016, 02:38 AM   #12
Nebetbastet
Junior Member
 
Location: France

Join Date: Apr 2016
Posts: 7
Default

Hi,
I understood what my problem was. Actually, it's quite trivial but I let you know in case someone would meet the same problem...


I used single-end data (most of the projects in my team are single-end). I just noticed Markduplicates needs paired-end data. I read the documentation too quickly and I was simply supposing Markduplicates could detect optical duplicates using both single-end and paired-end data.

I just used it in paired-end data and I could detect "optical" duplicates !
Nebetbastet is offline   Reply With Quote
Old 05-18-2016, 04:16 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

Where does it say that paired-end reads are required for this procedure (unless I am missing something)?

Tutorial you had originally linked does say the following

Quote:
For single end reads, duplicates are considered singly for the read, increasing the likelihood of being identified as a duplicate.
GenoMax is offline   Reply With Quote
Old 05-18-2016, 04:29 AM   #14
Nebetbastet
Junior Member
 
Location: France

Join Date: Apr 2016
Posts: 7
Default

In the command line overview, I can read:

Quote:
Identifies duplicate reads. This tool locates and tags duplicate reads (both PCR and optical/sequencing-driven) in a BAM or SAM file, where duplicate reads are defined as originating from the same original fragment of DNA. Duplicates are identified as read pairs having identical 5' positions (coordinate and strand) for both reads in a mate pair (and optionally, matching unique molecular identifier reads; see BARCODE_TAG option).
When I read that, I thought "OK, it is not said clearly, but it seems it needs paired-end data as there is no mention of single-end reads". And when I used paired-end reads, it worked (i.e., I found optical duplicates).

But indeed, in the tutorial, it is said single-end reads can be used... Actually, when I used single-end reads, duplicates were found (which means MarkDuplicates can use single-end reads to detect duplicates... ), but MarkDuplicates was unable to find "optical duplicates" (on all the samples of all the single-end datasets I used). It's quite confusing :s .

I let comments on the tutorial, so maybe I will get some answers.
Nebetbastet is offline   Reply With Quote
Old 05-18-2016, 05:16 AM   #15
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

Both reads would need to start at identical 5' co-ordinates to be certain that they represent an identical fragment so that makes sense as far as optical duplicates go.
GenoMax is offline   Reply With Quote
Reply

Tags
clustering duplicate, hiseq4000, markduplicates, optical duplicate

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:12 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO