Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    I'm attaching two plots from a sample with ~152 million reads, one truncated at a distance of 10000 and another going all the way to 50000. For what it's worth, I noticed that the Broad is using a distance of 2500 for patterned flow cell, which seems pretty reasonable. If one enables tile spanning, then you don't see saturation until ~20000, which seems a bit over the top.

    The tile spanning results seem a bit over the top, though interestingly a distance of 1 is sufficient to find stuff with that enabled. I'll post the density of duplicates according to X/Y coordinate to ensure there's no NextSeq-like tile edge effect.

    Update: Yup, no tile-edge effect, so not spanning tiles makes sense.

    Last edited by dpryan; 01-24-2017, 02:40 AM.

    Comment


    • #47
      Interesting. I was only looking at distances of up to 100 (does a plot of 10-100 show consistent increase, it should). It does look like 2500 should be a reasonable setting.

      Comment


      • #48
        Keep in mind that 10-100 on the HiSeq 2500 is equivalent to 100-1000 on HiSeq 3000/4000/X, apparently Illumina multiplied everything by 10 (because hey, why not?).

        Anyway, yes, it's essentially a linear increase starting at 20 (or 0 if spantiles is enabled).

        Edit: Originally wrote 8 instead of 20. I had something else in mind apparently...

        Comment


        • #49
          Will try dupedist=2500. Thanks for the heads-up.

          Not many had been systematically looking at duplicates (optical or otherwise) since Picard requires alignments for marking dups. We may find more duplicates than we expect/imagine once we start looking routinely.
          Last edited by GenoMax; 01-26-2017, 11:03 AM.

          Comment


          • #50
            Originally posted by dpryan View Post
            Keep in mind that 10-100 on the HiSeq 2500 is equivalent to 100-1000 on HiSeq 3000/4000/X, apparently Illumina multiplied everything by 10 (because hey, why not?).
            Oh, thanks for the heads-up! Now Genomax's earlier observations make a lot more sense; on earlier platforms you see the asymptote well before 100.

            Also, it looks like maybe spantiles should default to false since it is really only applicable to the NextSeq. The reason it shows an immediate increase at 1 is because cross-tile coordinate-based duplicate detection does not use Euclidean distances like the intra-tile duplicates. Instead, since only one coordinate is expected to be comparable (either X or Y), it uses the minimum of the X difference and Y difference. So, at dist=1, you are basically comparing a cluster to every other cluster in the same row or column. This does not have much affect on HiSeq 2500 or NextSeq (other than for tile-edge duplicates) but it looks like it has a huge impact on patterned flowcells because everything is already aligned in rows or columns.

            Comment


            • #51
              I looked at one sample from NovaSeq data that Illumina made available recently. Here is a scan for dupedist= (spantiles=f dupesubs=0) values similar to one @Devon posted above.

              Temporarily removing the image until @Brian reconfirms things.
              Last edited by GenoMax; 01-31-2017, 06:35 PM.

              Comment


              • #52
                Interesting, I guess a good cut-off there is 20000. Gotta love needing different settings per machine.

                Edit: Out of curiosity, what sort of percentage would 100 million reads represent? I also wonder if the setting will depend on which of S1/S2/S3/S4 flow cells are being used.
                Last edited by dpryan; 01-31-2017, 12:54 PM.

                Comment


                • #53
                  This sample had
                  Code:
                  Reads In:         1606392082
                  Clumps Formed:     356710875
                  Based on the name of the file it must have come from a S1 flowcell.
                  Last edited by GenoMax; 01-31-2017, 01:02 PM.

                  Comment


                  • #54
                    Originally posted by dpryan View Post
                    Interesting, I guess a good cut-off there is 20000.
                    I'm not really sure. The graph is strange. Genomax estimated that the tiles appear to be roughly 33k by 37k, so dist=20000 would span a third of the tile. I'm guessing that maybe this was a highly amplified library with lots of PCR duplicates, and the reason the slope decreases and eventually levels is due to a high error rate when using "subs=0". I will investigate further. But thanks @Genomax for posting it!

                    Gotta love needing different settings per machine.
                    Yes, it makes deciding on default parameters very difficult, especially when the company never really publicizes any of this so you need to discover it empirically...

                    Comment


                    • #55
                      Clumpify - dedup fasta/fastq

                      Hey Brian, first off, thanks for designing these great tools and providing thorough explanation.

                      I'm trying to de-duplicate small RNA seq reads from a 2x150 bp MiSeq run. The library prep for ultra-low input has made the adapter sequences less than straight forward so I've removed those prior to attempting to merge and remove duplicates.

                      My question is whether clumpify respects the read names and will only remove paired duplicates, or if it can remove all duplicates in the data. My reads have 3' and 5' 4N sequences (to reduce bias in adapter ligation) that can be used to identify PCR duplicates if I can somehow collapse to unique reads before trimming them.

                      Thanks in advance for your suggestions,

                      Stewart

                      Comment


                      • #56
                        If you have extra bases at the 3' and 5'-ends then they may complicate duplicate identification (you can add dupesubs=4) when running clumpify.

                        Here are combinations of flags for clumpify that may be useful.

                        Code:
                        dedupe=f optical=f (default)
                        Nothing happens with regards to duplicates.
                        
                        dedupe=t optical=f
                        All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.
                        
                        dedupe=f optical=t
                        Nothing happens.
                        
                        dedupe=t optical=t
                        
                        Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.
                        
                        The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.

                        Comment


                        • #57
                          Originally posted by Stewart Russell View Post
                          My question is whether clumpify respects the read names and will only remove paired duplicates, or if it can remove all duplicates in the data.
                          Clumpify does respect pairs and by default will only consider pairs duplicates if both reads match. In this situation it makes the most sense to me to merge reads first, then remove duplicates on the merged reads:

                          Code:
                          bbmerge.sh in=r1.fq in2=r2.fq out=merged.fq outu=unmerged.fq mininsert=20
                          clumpify.sh in=merged.fq out=merged_clumped.fq dedupe k=20
                          Merging will generally remove any adapter bases on the ends because those are not part of the overlapping sequence. Also, since merging removes bases off the ends of the adapters too, there will be fewer mismatches to confuse the issue. Note in this case I set "mininsert=20" and "k=20" because I don't know what your minimum expected length is for small RNAs, but by default BBMerge won't look for insert sizes shorter than 35, and by default Clumpify will not work on reads shorter than 31bp, so set those as appropriate. If you are not expecting insert sizes shorter than 35bp then you can remove those flags.

                          But, if you want to deduplicate the raw reads (though I'd really recommend adapter-trimming first), and ignore pairing, you can add the "unpair" flag to Clumpify:

                          Code:
                          clumpify.sh in1=r1.fq in2=r2.fq out=clumped.fq dedupe k=20 unpair
                          As Genomax mentioned, you will probably need to play with the "dupesubs" flag if there are expected mismatches due to nongenomic bases. Note that after doing this pairing order will be lost so you'd need to run repair.sh to recover it.
                          Last edited by Brian Bushnell; 02-05-2017, 09:45 AM.

                          Comment


                          • #58
                            Originally posted by Brian Bushnell View Post
                            by default BBMerge won't look for insert sizes shorter than 35, and by default Clumpify will not work on reads shorter than 31bp, so set those as appropriate. If you are not expecting insert sizes shorter than 35bp then you can remove those flags.
                            This explains why I was losing so many reads from BBmerge. I should have read the doc more carefully!

                            Originally posted by Brian Bushnell View Post
                            But, if you want to deduplicate the raw reads (though I'd really recommend adapter-trimming first), and ignore pairing, you can add the "unpair" flag to Clumpify
                            Given that I'm looking to merge/remove adapters, and then collapse based on sequence + 5' and 3' 4N, allowing for 3 non-templated bases, I think I need to do

                            Code:
                            bbmerge.sh in=r1.fq in2=r2.fq out=merged.fq outu=unmerged.fq mininsert=20
                            clumpify.sh in=merged.fq out=merged_clumped.fq dedupe dedupesubs=3 k=20 unpair
                            Originally posted by Brian Bushnell View Post
                            Note that after doing this pairing order will be lost so you'd need to run repair.sh to recover it.
                            At this point I don't think I need to re-pair because I have my data in a clean, single file for mapping.

                            Thanks for the help

                            Comment


                            • #59
                              A blog post about duplicates on HiSeq 4000 has been posted to QCFail site (with pretty graphics): https://sequencing.qcfail.com/articl...ated-sequences

                              Comment


                              • #60
                                Since inevitably others will run into this and wonder...

                                Please note that if you run FastQC on files run through clumpify, that you will see bias in the reported duplication rates between read #1 and read #2 in a pair. In other words, the duplication rate returned by FastQC for read #2 will be much much much (often 2-3x) higher than that of read #1. Note that this is a technical artifact due to FastQC's duplication module only looking at the first 100,000 reads in each file. Since the files can be reordered for increased compressibility, the first 100,000 reads are then expected to be similar and the results skewed.

                                In other words, don't worry about such observations.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X