Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    I am able to do something like

    Code:
    for i in `ls -1 *_1*.fastq | sed 's/_1.fastq//'`; do clumpify.sh -Xmx10g in1=$i\_1.fastq in2=$i\_2.fastq out1=$i\_clu_1.fastq out2=$i\_clu_2.fastq; done
    and have clumpify.sh produce two files. I am not sure why you are having trouble.

    Comment


    • #92
      I am having trouble using clumpify with the parameters optical + dedupe, to remove optical duplicates. e.x. clumpify.sh in=temp.fq.gz out=clumped.fq.gz dedupe optical. Clumpify without these parameters works

      Comment


      • #93
        Hi Brian,

        I'm really appreciating clumpify, it's fast & does exactly what it should.

        I'm trying to use it for optical duplicate detection, which works great. However, I wish only to report the number of optical duplicates, without creating the deduplicated output fastq file. Is there a possibility to skip producing output? At the moment the writing output step takes the longest time in my pipeline.

        Thanks in advance

        Comment


        • #94
          @DCZ: That is easy to do. If you do not provide any "out=" argument to most BBTools they will do the operation and produce relevant statistics without writing the output.

          Tip: If you ever want to pipe things then you can use "out=stdout.fq" from first tool and then "in=stdin.fq" for next tool. You get the idea.

          Comment


          • #95
            Ah, if only everything was this simple! Should've thought of that! Thanks!

            Comment


            • #96
              The Clumpify documentation webpage doesn't mention anything about removing duplicates, which I read about in a blog. May it be expanded upon? I have NovaSeq whole genome sequencing data (NovaSeq Control Software version 1.4.0 and Real Time Analysis version 3.3.3 acquisition followed by bcl2fastq version 2.20.0.422 conversion) for human samples of about 90 times coverage, so I think it's important that I use Clumpify. I intend to map the reads with bwa and I'm not sure if it supports some reads being pairs and some being merged (its documentation is minimal), so I plan to skip the clumping, if possible.

              Comment


              • #97
                @Dario1984: Have you run clumpify on your data? If you have excellent libraries with tightly controlled insert sizes you will find the duplicate rate to be well controlled. Brian has some explicit use cases in his Biostars post.

                Comment


                • #98
                  I have not run Clumpify but I will by following the examples which you linked to.

                  Comment


                  • #99
                    java.lang.AssertionError

                    Hello,

                    I'm using bbtools to preprocess some metagenomic hiseq reads prior to assembly and I've run into a little issue with clumpify. I am using the recommended 3 step error correction found in the AssemblyPipeline.sh script but the second error correction step stalls/freezes.

                    when I check the stderr file generated by the job I see these exceptions:

                    Exception in thread "Thread-1202" java.lang.AssertionError
                    at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                    at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                    at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)
                    Fetched 4595507 reads: 12.948 seconds.
                    --
                    Exception in thread "Thread-1203" java.lang.AssertionError
                    at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                    at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                    at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)

                    I have resubmitted the job and got the exact same exceptions the second time as well.

                    A little background: this job is running on a cluster with SLURM scheduling. The job requests an entire node with 40 processors and 125G of ram.

                    The reads are HiSeq PE 2x150 and the total size of the compressed reads is 343G.
                    This is the command that keeps stalling:
                    clumpify.sh in=ALL_temp.fq.gz out=ALL.eccc.fq.gz ecc passes=4 reorder

                    there are 1158 temp files generated by clumpify that occupy ~750G

                    Once this exception is thrown the whole job kindof just hangs.

                    Using version 38.43 with java 1.8.0_121

                    Any feedback would be greatly appreciated.

                    Thanks!

                    Comment


                    • Clumpify can need a lot of memory depending on size of data. With the data you have it is possible that you are simply running out of available memory. Have you looked into that?

                      Comment


                      • Just resubmitted on a high memory partition, hopefully this resolves the issue. Will update once the job finishes.

                        Comment


                        • So I resubmitted the job on a node with 40 processors and 1TB of memory and I received two very similar exceptions and the job is hanging again.

                          Exception in thread "Thread-147" java.lang.AssertionError
                          at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                          at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                          at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)
                          --
                          Exception in thread "Thread-146" java.lang.AssertionError
                          at clump.KmerSort3$FetchThread3.fetchNext_inner(KmerSort3.java:706)
                          at clump.KmerSort3$FetchThread3.fetchNext(KmerSort3.java:655)
                          at clump.KmerSort3$FetchThread3.run(KmerSort3.java:577)

                          Comment


                          • Can you provide the exact command line you are using? Is this being submitted via a job scheduler?

                            Comment


                            • It is submitted to a SLURM queue via the attached script.

                              These reads are a collection of concatenated interleaved paired end libraries

                              The same script worked well on the individual libraries, but I wanted to do an assembly with all of the reads together so I concatenated them all with
                              Code:
                              cat *fq.gz > ALL.fq.gz
                              The command that ends up stalling is this:
                              Code:
                              clumpify.sh in=ALL_temp.fq.gz out=ALL.eccc.fq.gz ecc passes=4 reorder

                              bbmerge plows through these reads with no complaints just prior to clumpify

                              Code:
                              bbmerge.sh in=ALL_temp.fq.gz out=ALL.ecco.fq.gz ecco mix vstrict ordered ihist=ALL_ihist_merge1.txt
                              Attached Files

                              Comment


                              • I think you should follow the order of tools that Brian has in his script example. Do clumpify job first. Since you are merging the reads first I am going to speculate that clumpify is unable to identify duplicates properly. If your data in not from a patterned flowcell you could remove the "optical" flag for clumpify.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X