Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #91
    I am able to do something like

    Code:
    for i in `ls -1 *_1*.fastq | sed 's/_1.fastq//'`; do clumpify.sh -Xmx10g in1=$i\_1.fastq in2=$i\_2.fastq out1=$i\_clu_1.fastq out2=$i\_clu_2.fastq; done
    and have clumpify.sh produce two files. I am not sure why you are having trouble.

    Comment

    • kokyriakidis
      Member
      • Jul 2018
      • 12

      #92
      I am having trouble using clumpify with the parameters optical + dedupe, to remove optical duplicates. e.x. clumpify.sh in=temp.fq.gz out=clumped.fq.gz dedupe optical. Clumpify without these parameters works

      Comment

      • DCZ
        Junior Member
        • Feb 2019
        • 4

        #93
        Hi Brian,

        I'm really appreciating clumpify, it's fast & does exactly what it should.

        I'm trying to use it for optical duplicate detection, which works great. However, I wish only to report the number of optical duplicates, without creating the deduplicated output fastq file. Is there a possibility to skip producing output? At the moment the writing output step takes the longest time in my pipeline.

        Thanks in advance

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #94
          @DCZ: That is easy to do. If you do not provide any "out=" argument to most BBTools they will do the operation and produce relevant statistics without writing the output.

          Tip: If you ever want to pipe things then you can use "out=stdout.fq" from first tool and then "in=stdin.fq" for next tool. You get the idea.

          Comment

          • DCZ
            Junior Member
            • Feb 2019
            • 4

            #95
            Ah, if only everything was this simple! Should've thought of that! Thanks!

            Comment

            • Dario1984
              Senior Member
              • Jun 2011
              • 166

              #96
              The Clumpify documentation webpage doesn't mention anything about removing duplicates, which I read about in a blog. May it be expanded upon? I have NovaSeq whole genome sequencing data (NovaSeq Control Software version 1.4.0 and Real Time Analysis version 3.3.3 acquisition followed by bcl2fastq version 2.20.0.422 conversion) for human samples of about 90 times coverage, so I think it's important that I use Clumpify. I intend to map the reads with bwa and I'm not sure if it supports some reads being pairs and some being merged (its documentation is minimal), so I plan to skip the clumping, if possible.

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                #97
                @Dario1984: Have you run clumpify on your data? If you have excellent libraries with tightly controlled insert sizes you will find the duplicate rate to be well controlled. Brian has some explicit use cases in his Biostars post.

                Comment

                • Dario1984
                  Senior Member
                  • Jun 2011
                  • 166

                  #98
                  I have not run Clumpify but I will by following the examples which you linked to.

                  Comment

                  • Chief_Lazy_Bison
                    Junior Member
                    • Dec 2014
                    • 9

                    #99
                    java.lang.AssertionError

                    Hello,

                    I'm using bbtools to preprocess some metagenomic hiseq reads prior to assembly and I've run into a little issue with clumpify. I am using the recommended 3 step error correction found in the AssemblyPipeline.sh script but the second error correction step stalls/freezes.

                    when I check the stderr file generated by the job I see these exceptions:

                    Exception in thread "Thread-1202" java.lang.AssertionError
                    at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                    at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                    at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)
                    Fetched 4595507 reads: 12.948 seconds.
                    --
                    Exception in thread "Thread-1203" java.lang.AssertionError
                    at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                    at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                    at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)

                    I have resubmitted the job and got the exact same exceptions the second time as well.

                    A little background: this job is running on a cluster with SLURM scheduling. The job requests an entire node with 40 processors and 125G of ram.

                    The reads are HiSeq PE 2x150 and the total size of the compressed reads is 343G.
                    This is the command that keeps stalling:
                    clumpify.sh in=ALL_temp.fq.gz out=ALL.eccc.fq.gz ecc passes=4 reorder

                    there are 1158 temp files generated by clumpify that occupy ~750G

                    Once this exception is thrown the whole job kindof just hangs.

                    Using version 38.43 with java 1.8.0_121

                    Any feedback would be greatly appreciated.

                    Thanks!

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      Clumpify can need a lot of memory depending on size of data. With the data you have it is possible that you are simply running out of available memory. Have you looked into that?

                      Comment

                      • Chief_Lazy_Bison
                        Junior Member
                        • Dec 2014
                        • 9

                        Just resubmitted on a high memory partition, hopefully this resolves the issue. Will update once the job finishes.

                        Comment

                        • Chief_Lazy_Bison
                          Junior Member
                          • Dec 2014
                          • 9

                          So I resubmitted the job on a node with 40 processors and 1TB of memory and I received two very similar exceptions and the job is hanging again.

                          Exception in thread "Thread-147" java.lang.AssertionError
                          at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                          at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                          at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)
                          --
                          Exception in thread "Thread-146" java.lang.AssertionError
                          at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                          at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                          at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            Can you provide the exact command line you are using? Is this being submitted via a job scheduler?

                            Comment

                            • Chief_Lazy_Bison
                              Junior Member
                              • Dec 2014
                              • 9

                              It is submitted to a SLURM queue via the attached script.

                              These reads are a collection of concatenated interleaved paired end libraries

                              The same script worked well on the individual libraries, but I wanted to do an assembly with all of the reads together so I concatenated them all with
                              Code:
                              cat *fq.gz > ALL.fq.gz
                              The command that ends up stalling is this:
                              Code:
                              clumpify.sh in=ALL_temp.fq.gz out=ALL.eccc.fq.gz ecc passes=4 reorder

                              bbmerge plows through these reads with no complaints just prior to clumpify

                              Code:
                              bbmerge.sh in=ALL_temp.fq.gz out=ALL.ecco.fq.gz ecco mix vstrict ordered ihist=ALL_ihist_merge1.txt
                              Attached Files

                              Comment

                              • GenoMax
                                Senior Member
                                • Feb 2008
                                • 7142

                                I think you should follow the order of tools that Brian has in his script example. Do clumpify job first. Since you are merging the reads first I am going to speculate that clumpify is unable to identify duplicates properly. If your data in not from a patterned flowcell you could remove the "optical" flag for clumpify.

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                  by SEQadmin2


                                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                  Here are nine questions we think about, in roughly the order they matter, before...
                                  06-18-2026, 07:11 AM
                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 06-17-2026, 06:09 AM
                                0 responses
                                21 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-09-2026, 11:58 AM
                                0 responses
                                40 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-05-2026, 10:09 AM
                                0 responses
                                46 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-04-2026, 08:59 AM
                                0 responses
                                49 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...