Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Optical duplicates Hiseq4000

    Dear all,

    I am working with RNA data sequenced on the Hiseq4000 sequencer. I am trying to quantify the number of "optical duplicates" or "clustering duplicates". These duplicates appear when reads in nearby wells result from secondary exAmp seeding from a primary well when concentrations are sub-optimal.

    I used MarkDuplicates (Picard 2.1.1) and followed this procedure : http://gatkforums.broadinstitute.org...swithmatecigar

    But each time, MarkDuplicates find "0 optical duplicate clusters"...

    I tested two alignement tools: TopHat and BWA, but each time, MarkDuplicates find no optical duplicate.

    I tried on 96 samples.

    Do you have any idea of why I cannot find any optical duplicate?

    Thank you very much

  • #2
    Can you provide some additional information? Is this a PE dataset? What was the PF% for the lanes (I assume these 96 samples came from one flowcell)? What are the alignment % for the aligners you have used?

    Comment


    • #3
      Thank you GenoMax for your answer.

      - It is a 50bp single-end dataset
      - Bcl2fastq tells me that the "%PF Clusters" is 100% for all the samples
      - Using Tophat, the percentage of mapped reads ranges from 73.3% to 96.4%, with a median equal to 93.5%.
      - I used BWA only on one sample: I found that 93.3% of reads mapped to the reference genome

      Thank you in advance for your help

      Comment


      • #4
        That seems a bit odd. Based on the training for HiSeq 4000 we were told that the sweet spot for PF is around 70%. Any more (once you get closer to 75%) would indicate that there will be a lot duplicates.

        When running Picard MarkDuplicates did you adjust the OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 as recommended in the link you had posted above?

        Perhaps you got lucky (and/or you have a library of excellent quality) and there are no duplicates. Though that seems a bit too good to be true.

        Comment


        • #5
          Thank you for your answer.

          Yes, I adjusted at 2500 as indicated in the link.

          As you say, I find it's a little too good to be true...

          Comment


          • #6
            Have you contacted tech support? It may be worth getting their take on this.

            I am finding 100% PF hard to believe. Are there really 3.2B reads in your dataset? Does the quality look fine?

            Comment


            • #7
              Originally posted by Nebetbastet View Post
              - Bcl2fastq tells me that the "%PF Clusters" is 100% for all the samples
              Originally posted by GenoMax View Post
              Have you contacted tech support? It may be worth getting their take on this.

              I am finding 100% PF hard to believe. Are there really 3.2B reads in your dataset? Does the quality look fine?
              This is just a reporting quirk when you run Bcl2fastq without using the "--with-failed-reads" option. Since it is only converting and demultiplexing PF reads it reports them as 100% PF.

              NOTE: This is true for Bcl2fastq v1.8.4. I have never tested the newer, 2.x versions of Bcl2fastq.

              Comment


              • #8
                It would be odd if bcl2fastq v.2 was run with "--with-failed-reads" option but that may be a logical explanation for the 100% PF observation.

                Comment


                • #9
                  Hi,

                  Sorry for my slow reply. I was investigating for the 100% PF... Actually, this is a wrong number. The %PF is 71%.

                  Comment


                  • #10
                    That sounds more logical. Any update on optical duplicates? I have not been able to replicate the settings recommended on GATK site for a small number of samples I have tried.

                    See this for an update on how samtools/GATK may handle this in future.

                    Comment


                    • #11
                      No, no update
                      Thank you for the link to this discussion !

                      Comment


                      • #12
                        Hi,
                        I understood what my problem was. Actually, it's quite trivial but I let you know in case someone would meet the same problem...


                        I used single-end data (most of the projects in my team are single-end). I just noticed Markduplicates needs paired-end data. I read the documentation too quickly and I was simply supposing Markduplicates could detect optical duplicates using both single-end and paired-end data.

                        I just used it in paired-end data and I could detect "optical" duplicates !

                        Comment


                        • #13
                          Where does it say that paired-end reads are required for this procedure (unless I am missing something)?

                          Tutorial you had originally linked does say the following

                          For single end reads, duplicates are considered singly for the read, increasing the likelihood of being identified as a duplicate.

                          Comment


                          • #14
                            In the command line overview, I can read:

                            Identifies duplicate reads. This tool locates and tags duplicate reads (both PCR and optical/sequencing-driven) in a BAM or SAM file, where duplicate reads are defined as originating from the same original fragment of DNA. Duplicates are identified as read pairs having identical 5' positions (coordinate and strand) for both reads in a mate pair (and optionally, matching unique molecular identifier reads; see BARCODE_TAG option).
                            When I read that, I thought "OK, it is not said clearly, but it seems it needs paired-end data as there is no mention of single-end reads". And when I used paired-end reads, it worked (i.e., I found optical duplicates).

                            But indeed, in the tutorial, it is said single-end reads can be used... Actually, when I used single-end reads, duplicates were found (which means MarkDuplicates can use single-end reads to detect duplicates... ), but MarkDuplicates was unable to find "optical duplicates" (on all the samples of all the single-end datasets I used). It's quite confusing :s .

                            I let comments on the tutorial, so maybe I will get some answers.

                            Comment


                            • #15
                              Both reads would need to start at identical 5' co-ordinates to be certain that they represent an identical fragment so that makes sense as far as optical duplicates go.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X