Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • luc
    Senior Member
    • Dec 2010
    • 469

    #31
    Hi Brian,

    that dedupe function looks great! We have been waiting for such a tool.

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      #32
      Thanks, luc, I appreciate it.

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #33
        Hi Brian, any update on allowing non-interleaved input/output? I'd love to remove the reformat.sh steps before and after clumpify.sh

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #34
          Hi Devon,

          Yes, this is all done, I just haven't released it yet. I'll do so tomorrow (difficult for me to do where I am now; today's a vacation day here).

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #35
            Ah, right, MLK day. Enjoy the day off and stop checking SEQanswers!

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #36
              It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

              Code:
              clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #37
                Originally posted by Brian Bushnell View Post
                It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

                Code:
                clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
                Yay! Two less operations ...

                Comment

                • dpryan
                  Devon Ryan
                  • Jul 2011
                  • 3478

                  #38
                  Originally posted by Brian Bushnell View Post
                  It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

                  Code:
                  clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
                  A day late is still a quick turn around

                  Thanks for the great update!

                  Comment

                  • dpryan
                    Devon Ryan
                    • Jul 2011
                    • 3478

                    #39
                    Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.

                    The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason.

                    BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      #40
                      Originally posted by dpryan View Post
                      Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.
                      That request has been in for some time I also wanted to see counts (with associated sequence) to see how acute of a problem the duplicates may be.

                      For now use the following workaround provided by @Brian.

                      Code:
                      clumpify.sh in=x.fq out=y.fq markduplicates [optical allduplicates subs=0]
                      filterbyname.sh in=y.fq out=dupes.fq names=duplicate substring include
                      filterbyname.sh in=y.fq out=unique.fq names=duplicate substring include=f
                      BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...
                      This is a bit murky. I have done the sweeps with 4000 data I have access to. If I keep the spantiles=f then I don't see any optical dups until dupedist=20. Note: The edge duplicates problem seen with NextSeq (which has @Brian setting spantiles=t by default) is not present in HiSeq 4000/MiSeq (again based on data I have seen).

                      I have not pulled out the reads using the method above to look at the co-ordinates/sequence as yet.

                      It may be good to see what you get.
                      Last edited by GenoMax; 01-23-2017, 06:13 AM.

                      Comment

                      • dpryan
                        Devon Ryan
                        • Jul 2011
                        • 3478

                        #41
                        Additionally, is there any way to make clumpify itself respect the "threads=" setting? pigz seems to, but clumpify itself seems to use as many as it can get regardless of what I specify. This is in version 36.86.

                        Comment

                        • dpryan
                          Devon Ryan
                          • Jul 2011
                          • 3478

                          #42
                          Originally posted by GenoMax View Post
                          For now use the following workaround provided by @Brian.

                          Code:
                          clumpify.sh in=x.fq out=y.fq markduplicates [optical allduplicates subs=0]
                          filterbyname.sh in=y.fq out=dupes.fq names=duplicate substring include
                          filterbyname.sh in=y.fq out=unique.fq names=duplicate substring include=f
                          Thanks, in the interim I just wrote something in C that I can just call once to do this (it also strips "duplicate" from the read names).

                          Originally posted by GenoMax View Post
                          This is a bit murky. I have done the sweeps with 4000 data I have access to. If I keep the spantiles=f then I don't see any optical dups until dupedist=20. Note: The edge duplicates problem seen with NextSeq (which has @Brian setting spantiles=t by default) is not present in HiSeq 4000/MiSeq (again based on data I have seen).
                          Thanks, I'm running this now on a single sample, I'll post in image when I have a worthwhile sweep range.

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #43
                            Originally posted by dpryan View Post
                            Thanks, I'm running this now on a single sample, I'll post in image when I have a worthwhile sweep range.
                            Do the sweep with spanfiles=f and t. I was only interested in optical duplicates when I did mine.

                            Comment

                            • Brian Bushnell
                              Super Moderator
                              • Jan 2014
                              • 2709

                              #44
                              Originally posted by dpryan View Post
                              Additionally, is there any way to make clumpify itself respect the "threads=" setting? pigz seems to, but clumpify itself seems to use as many as it can get regardless of what I specify. This is in version 36.86.
                              Oh, hmmm... that will be very tricky. When running with one group, Clumpify should respect threads correctly. But when writing temp files (when happens whenever the reads won't all fit in memory), it uses at least one thread per temp file, and the default is a minimum of 11 temp files. Your best bet, unfortunately, would be to bind the process to a certain number of cores. You can also manually set the number of groups which indirectly affect the number of threads used.

                              Clumpify also uses multithreaded sorting, which uses all available cores, but normally that only happens for a small fraction of the runtime. However, I will add a flag to disable it.

                              Comment

                              • Brian Bushnell
                                Super Moderator
                                • Jan 2014
                                • 2709

                                #45
                                Originally posted by dpryan View Post
                                Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.

                                The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason.
                                I will plan to add a new output stream for duplicate files as well, though I might not get to it this week.

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                  by SEQadmin2


                                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                  Here are nine questions we think about, in roughly the order they matter, before...
                                  06-18-2026, 07:11 AM
                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 06-17-2026, 06:09 AM
                                0 responses
                                21 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-09-2026, 11:58 AM
                                0 responses
                                40 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-05-2026, 10:09 AM
                                0 responses
                                46 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-04-2026, 08:59 AM
                                0 responses
                                49 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...