Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by GenoMax View Post
    I also wanted to see counts (with associated sequence) to see how acute of a problem the duplicates may be.
    Update! Clumpify can now report the number of duplicates per read. Usage:

    Code:
    clumpify.sh in=file.fq out=deduped.fq dedupe addcount
    Note that this does NOT work in "markduplicates" mode, only in "dedupe" mode. Reads that absorbed duplicates will have "copies=X" appended to the end to indicate how many reads they represent (including themselves, so the minimum you would see is 2).
    Last edited by Brian Bushnell; 05-04-2017, 03:06 PM.

    Comment


    • #62
      Wonderful!

      Will give it a try as soon as I can get bbmap upgraded to latest on the cluster.

      Comment


      • #63
        Originally posted by Brian Bushnell View Post
        Note that this does NOT work in "markduplicates" mode, only in "dedupe" mode. Reads that absorbed duplicates will have "count=X" appended to the end to indicate how many reads they represent (including themselves, so the minimum you would see is 2).
        Hi Brian,

        Thank you for adding these new features. I tried clumpify (37.17) to detect duplicates in the HiSeq 2500 single-ended strand-specific 50 bp reads. My pipeline is Clumpify-BBDuk-BBMap-featureCounte-a) edgeR for differential expression analysis or b) network construction. I got some qestuions about performance and output after several trials on dedup.

        1) I noticed that when I added dedupe=t, the speed slowed down for ~10 time. Is this expected or did I set some options incorrectly?

        No dedup:
        Code:
        clumpify.sh -Xmx44g in=$INPUT out=$OUTPUT.clumped.fq.gz reorder=t overwrite=t
        
        Time:                           3540.405 seconds.
        [COLOR="red"]Reads Processed:      27914k    7.88k reads/sec[/COLOR]
        Bases Processed:       1423m    0.40m bases/sec
        
        Reads In:           27914336
        Clumps Formed:       2997579
        Total time:     3540.575 seconds.
        Remove optical duplicates:
        Code:
        clumpify.sh -Xmx48g in=$INPUT out=$OUTPUT.clumped.fq.gz reorder=t overwrite=t \
        dedupe=t addcount=t dupedist=40 optical=t
        
        Time:                           38030.384 seconds.
        [COLOR="Red"]Reads Processed:      27914k    0.73k reads/sec[/COLOR]
        Bases Processed:       1423m    0.04m bases/sec
        
        Reads In:           27914336
        Clumps Formed:       2997579
        Duplicates Found:      27931
        Total time:     41299.460 seconds.
        2) For the output, where is the 'count=x' supposed to be stored? I did not see it in the out fq file.

        3) I would like to get an idea about the ratio of optical duplicates vs PCR duplicates. To do that, should I run the first time with optical=t and the second time with optical=f, and then compare the difference between these two runs to get the PCR duplicates?

        Thanks a lot!

        Comment


        • #64
          Originally posted by chiayi View Post
          2) For the output, where is the 'count=x' supposed to be stored? I did not see it in the out fq file.
          It is stored in the fastq header in the "clumped" file. Only where clustered reads were found.

          3) I would like to get an idea about the ratio of optical duplicates vs PCR duplicates. To do that, should I run the first time with optical=t and the second time with optical=f, and then compare the difference between these two runs to get the PCR duplicates?

          Thanks a lot!
          Code:
          dedupe=f optical=f (default)
          Nothing happens with regards to duplicates.
          
          dedupe=t optical=f
          All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.
          
          dedupe=f optical=t
          Nothing happens.
          
          dedupe=t optical=t
          
          Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.
          The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.
          You should add "spantiles=f" for all non-NextSeq data. For optical duplicates you would need to specify "dupedist=". There are suggested values in this thread.
          Last edited by GenoMax; 05-04-2017, 10:29 AM.

          Comment


          • #65
            1) No, that's not expected, nor do I see a reason why that might be happening. Also, the performance is amazingly slow on your system in both cases. Here's what I get:

            Code:
            clumpify.sh in=reads.fq.gz out=clumped.fq.gz reorder
            
            Time:                           40.496 seconds.
            Reads Processed:      18059k    445.97k reads/sec
            Bases Processed:       1824m    45.04m bases/sec
            
            Reads In:           18059968
            Clumps Formed:        809674
            Total time:     40.680 seconds.
            Code:
            clumpify.sh in=reads.fq.gz out=clumped2.fq.gz reorder dedupe optical dupedist=40 addcount
            
            Time:                           39.087 seconds.
            Reads Processed:      18059k    462.04k reads/sec
            Bases Processed:       1824m    46.67m bases/sec
            
            Reads In:           18059968
            Clumps Formed:        405473
            Duplicates Found:        864
            Total time:     39.135 seconds.
            These are respectively about 400 and 4000 times faster than what you're seeing... I can only speculate that it might have something to do with network congestion... also, do you have pigz in your path? When you look at top while the process is running, what is the reported CPU utilization of java and pigz?

            2) The count is only present for reads that had duplicates; duplicate-free reads won't have it. For example, it looks like this:

            Code:
            reformat.sh in=reads.fq.gz out=1pair.fq reads=1
            cat 1pair.fq 1pair.fq 1pair.fq > 3pair.fq
            clumpify.sh in=3pair.fq out=deduped_3pair.fq dedupe addcount
            cat deduped_3pair.fq
            
            @HISEQ13:296:CA8E8ANXX:7:1101:1432:1903 1:N:0:ACGGAACTGTTCCG copies=3
            NTTGGATTGCATTAAGGAGCGAGGATGAGCATTCCATTCACCCGCTGGCCGGAAGAGTTTGCCCGTCGCTATCGGGAAAAAGGCTACTGGCAGGATTTGCC
            +
            !<=@=BFDGGGGGGBDDFGGGBGDGGFGGGGGGGGGGGGGGGGGGGGGBG@C/>EGGGGGGCG>DGGGGGGGCGGGGGGGGB>GGGGGGGGGGGGGCDGED
            @HISEQ13:296:CA8E8ANXX:7:1101:1432:1903 2:N:0:ACGGAACTGTTCCG copies=3
            TCGTTGAGCAGTTGCACCACGCGAATGGAGGAATGTTCTGTGACGAAAGTATTGAGGAAATCATCCCCGCTAAACAGCGCATGTTGGCGATCGGCAATCAG
            +
            BBBBBF0=1;@FGGDDGEGGDGD/EGGG:>E/BGGFDGGGGG@FGBEFGGEGGGGGDCGGCDGGCCCGGBGGDDGGEGGD>GGGGGGFD8?D<FGDG/EGC

            Comment


            • #66
              1) pigz is in my path.
              *Updated* When I monitored a running job, here's what I saw:
              Code:
              PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
              20   0 56.006g 0.028t  13060 S 100.3 22.8 343:27.03 java
              20   0   48424   1864   1324 S   0.0  0.0   0:00.00 pigz
              It looked java is using all the requested CPUs. Also, the VM reached the limit of requested memory while the physical did not. I also confirmed that there's no limit on virtual memory. Any thought what might have caused the job stuck like this?

              2) I was looking for 'count=x' in the header thus didn't find anything. When I looked for 'copies=X' then I saw the duplicates.

              Thank you!
              Last edited by chiayi; 05-04-2017, 02:37 PM.

              Comment


              • #67
                Originally posted by chiayi View Post
                1) pigz is in my path.
                *Updated* When I monitored a running job, here's what I saw:
                Code:
                PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                20   0 56.006g 0.028t  13060 S 100.3 22.8 343:27.03 java
                20   0   48424   1864   1324 S   0.0  0.0   0:00.00 pigz
                It looked java is using all the requested CPUs. Also, the VM reached the limit of requested memory while the physical did not. I also confirmed that there's no limit on virtual memory. Any thought what might have caused the job stuck like this?
                Oh... in rare cases, "reorder" can cause it to run very slowly, stuck at 100% CPU utilization. That might be the issue here... try removing that flag. During normal execution, pigz is also using CPU-time and java is usually substantially higher than 100%. What did the screen output look like when top was showing this?

                2) I was looking for 'count=x' in the header thus didn't find anything. When I looked for 'copies=X' then I saw the duplicates.
                Ooops, sorry, I updated my post to correct that!

                Comment


                • #68
                  Originally posted by Brian Bushnell View Post
                  Oh... in rare cases, "reorder" can cause it to run very slowly, stuck at 100% CPU utilization. That might be the issue here... try removing that flag. During normal execution, pigz is also using CPU-time and java is usually substantially higher than 100%. What did the screen output look like when top was showing this?
                  It was at the 'Deduping' step. I removed reorder tag and it still stuck at the same step:

                  Code:
                  openjdk version "1.8.0_121"
                  OpenJDK Runtime Environment (build 1.8.0_121-b13)
                  OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
                  java -ea -Xmx48g -Xms48g -cp ~/package/bbmap/current/ clump.Clumpify 
                  -Xmx48g in=in.fastq.gz out=out.fq.bz2 dedupe=t addcount=t dupedist=40 
                  optical=t
                  Executing clump.Clumpify [-Xmx48g, in=in.fastq.gz,
                  out=out.clumped.fq.bz2, dedupe=t, addcount=t, dupedist=40, optical=t]
                  
                  Clumpify version 37.17
                  Read Estimate:          30447286
                  Memory Estimate:        13555 MB
                  Memory Available:       38586 MB
                  Set groups to 1
                  Executing clump.KmerSort [in1=in.fas in2=null, out1=out.clumped.fq.bz2, 
                  out2=null, groups=1, ecco=false, rename=false, shortname=f, 
                  [COLOR="Red"]unpair=false[/COLOR], repair=false, namesort=false, 
                  ow=true, -Xmx48g, dedupe=t, addcount=t, dupedist=40, optical=t]
                  
                  Making comparator.
                  Made a comparator with k=31, seed=1, border=1, hashes=4
                  Starting cris 0.
                  Fetching reads.
                  Making fetch threads.
                  Starting threads.
                  Waiting for threads.
                  Fetch time:     95.473 seconds.
                  Closing input stream.
                  Combining thread output.
                  Combine time:   0.133 seconds.
                  Sorting.
                  Sort time:      36.426 seconds.
                  Making clumps.
                  Clump time:     6.788 seconds.
                  Deduping.
                  At the time the top looked like this:

                  Code:
                  PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                  20   0 56.205g 0.023t  13008 S  99.7 18.7  10:00.83 java
                  20   0  188364   3932   2448 R   0.3  0.0   0:00.75 top
                  20   0   81452   1256   1012 S   0.0  0.0   0:00.00 pbzip2
                  I noticed the unpair=f and wasn't sure it was appropriate with my SE reads so I added unpair=t. Then the speed looked better yet it stuck at a different place:

                  Code:
                  Executing clump.KmerSplit [in1=in.fastq.gz, in2=null, 
                  out=out.fq.bz2, out2=null, groups=11, ecco=false, addname=f, 
                  shortname=f, [COLOR="Blue"]unpair=true[/COLOR], repair=f, namesort=f, ow=true, 
                  -Xmx16g, dedupe=t, addcount=t, dupedist=40, optical=t]
                  
                  Input is being processed as unpaired
                  Made a comparator with k=31, seed=1, border=1, hashes=4
                  Time:                           54.743 seconds.
                  Reads Processed:      27914k    509.91k reads/sec
                  Bases Processed:       1423m    26.01m bases/sec
                  Executing clump.KmerSort3 [in1=in.clumped_clumpify_p1_temp%_2e77f340ad809301.fq.bz2, in2=null,
                   out=out.clumped.fq.bz2, out2=null, groups=11, ecco=f, addname=false,
                   shortname=f, [COLOR="red"]unpair=f[/COLOR], repair=false,
                   namesort=false, ow=true, -Xmx16g, dedupe=t, addcount=t, dupedist=40, optical=t]
                  
                  Making comparator.
                  Made a comparator with k=31, seed=1, border=1, hashes=4
                  Making 2 fetch threads.
                  Starting threads.
                  Fetching reads.
                  Exception in thread "Thread-57"
                   *Control-C or similar caught [sig=15], quitting...
                  Exception in thread "Thread-58" Terminator thread: premature exit requested - quitting...
                  java.lang.RuntimeException: Duplicate process for file nutrinetat_vegb11_edi-0_r2y.clumped_clumpify_p1_temp0_2e77f340ad809301.fq.bz2
                          at fileIO.ReadWrite.addProcess(ReadWrite.java:1599)
                          at fileIO.ReadWrite.getInputStreamFromProcess(ReadWrite.java:1050)
                          at fileIO.ReadWrite.getUnpbzip2Stream(ReadWrite.java:986)
                          at fileIO.ReadWrite.getBZipInputStream2(ReadWrite.java:1086)
                          at fileIO.ReadWrite.getBZipInputStream(ReadWrite.java:1066)
                          at fileIO.ReadWrite.getInputStream(ReadWrite.java:802)
                          at fileIO.ByteFile1.open(ByteFile1.java:261)
                          at fileIO.ByteFile1.<init>(ByteFile1.java:96)
                          at fileIO.ByteFile.makeByteFile(ByteFile.java:26)
                          at stream.FastqReadInputStream.<init>(FastqReadInputStream.java:61)
                          at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:119)
                          at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:55)
                          at clump.KmerSort3$FetchThread.fetchNext(KmerSort3.java:853)
                          at clump.KmerSort3$FetchThread.run(KmerSort3.java:825)
                  When this happened, top looked like this,

                  Code:
                  PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                  20   0  188332   3956   2452 R   0.3  0.0   0:04.02 top
                  20   0 23.961g 4.125g  12916 S   0.0  1.6   0:42.05 java
                  20   0   81452   1256   1012 S   0.0  0.0   0:00.01 pbzip2
                  20   0 1950332 173572   1056 S   0.0  0.1   0:03.36 pbzip2
                  20   0 1941568 173768   1052 S   0.0  0.1   0:03.49 pbzip2
                  Not sure if it's related but is the "unpair=f" expected during the clump, even though I added unpair=t in the command?

                  Comment


                  • #69
                    OK, you can kill it - it won't finish. This is great feedback, by the way - I really appreciate it.

                    1) What version of BBMap are you using?
                    2) "unpair=f" is the default and that flag should always be ignored except when you are doing error-correction; it is exclusively for error-correcting paired reads. I don't see why that would have anything to do with this problem, but I'm not really sure.
                    3) This dataset is pretty small. Would it be possible for you to send it to me so I can try to replicate the issue?
                    4) I will note that your second version was run differently than the first one - it was using only 16GB RAM. The behavior of Clumpify is different when running with enough memory to hold all reads, and with not enough memory to hold all reads (in which case it needs to write temp files). It works in both cases, but when diagnosing errors, it's easiest to run with the same -Xmx parameter in all cases.

                    Thanks,
                    Brian

                    Comment


                    • #70
                      @santiagorevale

                      Sorry for not clearly stating this before, but the latest versions of Clumpify support in1, in2, out1, and out2 flags for paired reads in twin files.
                      Last edited by Brian Bushnell; 05-04-2017, 06:59 PM.

                      Comment


                      • #71
                        Originally posted by Brian Bushnell View Post
                        1) What version of BBMap are you using?
                        2) "unpair=f" is the default and that flag should always be ignored except when you are doing error-correction; it is exclusively for error-correcting paired reads. I don't see why that would have anything to do with this problem, but I'm not really sure.
                        3) This dataset is pretty small. Would it be possible for you to send it to me so I can try to replicate the issue?
                        4) I will note that your second version was run differently than the first one - it was using only 16GB RAM. The behavior of Clumpify is different when running with enough memory to hold all reads, and with not enough memory to hold all reads (in which case it needs to write temp files). It works in both cases, but when diagnosing errors, it's easiest to run with the same -Xmx parameter in all cases.
                        1) 37.17
                        2) 4) I changed the RAM to get the job start sooner. You are certainly right, the setting should be identical. I started another run using the same mem and this time it still stuck at the dedupting step. Summary: With -Xmx48g, the run stuck at dedupting, regardless the settings of reorder and unpair.
                        3) I shared the file with your lbl.gov email. Please let me know if you didn't get it.

                        Thank you very much for your time and patience.

                        Comment


                        • #72
                          Originally posted by Brian Bushnell View Post
                          1) What version of BBMap are you using?
                          2) "unpair=f" is the default and that flag should always be ignored except when you are doing error-correction; it is exclusively for error-correcting paired reads. I don't see why that would have anything to do with this problem, but I'm not really sure.
                          3) This dataset is pretty small. Would it be possible for you to send it to me so I can try to replicate the issue?
                          4) I will note that your second version was run differently than the first one - it was using only 16GB RAM. The behavior of Clumpify is different when running with enough memory to hold all reads, and with not enough memory to hold all reads (in which case it needs to write temp files). It works in both cases, but when diagnosing errors, it's easiest to run with the same -Xmx parameter in all cases.
                          1) 37.17
                          2) 4) I changed the RAM to get the job start sooner. You are certainly right, the setting should be identical. I started another run using the same mem and this time it still stuck at the dedupting step. Summary: With -Xmx48g, the run stuck at dedupting, regardless the settings of reorder and unpair.
                          3) I shared the file with your email. Please let me know if you didn't get it.

                          Thank you very much for your time and patience.

                          Comment


                          • #73
                            Got it; I'm working on it. It will take several days. But thanks!

                            Comment


                            • #74
                              OK, I can't replicate the slowness. I get this:

                              Code:
                              bushnell@gpint209:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g
                              java version "1.8.0_31"
                              Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
                              Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
                              java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g
                              Executing clump.Clumpify [in=chiayi.fq.gz, out=clumped.fq.gz, -Xmx63g]
                              
                              
                              Clumpify version 37.22
                              Read Estimate:          30447286
                              Memory Estimate:        13555 MB
                              Memory Available:       50656 MB
                              Set groups to 1
                              Executing clump.KmerSort [in1=chiayi.fq.gz, in2=null, out1=clumped.fq.gz, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g]
                              
                              Making comparator.
                              Made a comparator with k=31, seed=1, border=1, hashes=4
                              Starting cris 0.
                              Fetching reads.
                              Making fetch threads.
                              Starting threads.
                              Waiting for threads.
                              Fetch time:     18.757 seconds.
                              Closing input stream.
                              Combining thread output.
                              Combine time:   0.127 seconds.
                              Sorting.
                              Sort time:      3.903 seconds.
                              Making clumps.
                              Clump time:     1.112 seconds.
                              Writing.
                              Waiting for writing to complete.
                              Write time:     15.535 seconds.
                              Done!
                              Time:                           39.821 seconds.
                              Reads Processed:      27914k    701.00k reads/sec
                              Bases Processed:       1423m    35.75m bases/sec
                              
                              Reads In:           27914336
                              Clumps Formed:       2997579
                              Total time:     40.089 seconds.
                              bushnell@gpint209:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder
                              java version "1.8.0_31"
                              Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
                              Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
                              java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder
                              Executing clump.Clumpify [in=chiayi.fq.gz, out=clumped.fq.gz, -Xmx63g, reorder]
                              
                              
                              Clumpify version 37.22
                              Read Estimate:          30447286
                              Memory Estimate:        13555 MB
                              Memory Available:       50656 MB
                              Set groups to 1
                              Executing clump.KmerSort [in1=chiayi.fq.gz, in2=null, out1=clumped.fq.gz, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g, reorder]
                              
                              Making comparator.
                              Made a comparator with k=31, seed=1, border=1, hashes=4
                              Starting cris 0.
                              Fetching reads.
                              Making fetch threads.
                              Starting threads.
                              Waiting for threads.
                              Fetch time:     18.471 seconds.
                              Closing input stream.
                              Combining thread output.
                              Combine time:   0.170 seconds.
                              Sorting.
                              Sort time:      4.112 seconds.
                              Making clumps.
                              Clump time:     19.301 seconds.
                              Writing.
                              Waiting for writing to complete.
                              Write time:     13.423 seconds.
                              Done!
                              Time:                           56.050 seconds.
                              Reads Processed:      27914k    498.02k reads/sec
                              Bases Processed:       1423m    25.40m bases/sec
                              
                              Reads In:           27914336
                              Clumps Formed:       2997579
                              Total time:     56.125 seconds.
                              bushnell@gpint209:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder dedupe
                              java version "1.8.0_31"
                              Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
                              Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
                              java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder dedupe
                              Executing clump.Clumpify [in=chiayi.fq.gz, out=clumped.fq.gz, -Xmx63g, reorder, dedupe]
                              
                              
                              Clumpify version 37.22
                              Read Estimate:          30447286
                              Memory Estimate:        13555 MB
                              Memory Available:       50656 MB
                              Set groups to 1
                              Executing clump.KmerSort [in1=chiayi.fq.gz, in2=null, out1=clumped.fq.gz, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g, reorder, dedupe]
                              
                              Making comparator.
                              Made a comparator with k=31, seed=1, border=1, hashes=4
                              Starting cris 0.
                              Fetching reads.
                              Making fetch threads.
                              Starting threads.
                              Waiting for threads.
                              Fetch time:     18.377 seconds.
                              Closing input stream.
                              Combining thread output.
                              Combine time:   0.174 seconds.
                              Sorting.
                              Sort time:      4.421 seconds.
                              Making clumps.
                              Clump time:     19.694 seconds.
                              Deduping.
                              Dedupe time:    0.767 seconds.
                              Writing.
                              Waiting for writing to complete.
                              Write time:     5.675 seconds.
                              Done!
                              Time:                           49.223 seconds.
                              Reads Processed:      27914k    567.09k reads/sec
                              Bases Processed:       1423m    28.92m bases/sec
                              
                              Reads In:           27914336
                              Clumps Formed:       2997579
                              Duplicates Found:   20066115
                              Total time:     49.299 seconds.
                              So, it takes under a minute regardless of whether I add "dedupe" or "reorder". The primary difference I see so far is that you are using OpenJDK while I'm using the Oracle JDK. Well, you're also using pbzip2 instead of gzip. But even when I converted to bz2 it was still fast; the output of top looked like this:

                              Code:
                              PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
                               9032 bushnell  20   0 66.1g  16g  11m S 1474 13.5   1:16.19 java
                               9177 bushnell  20   0  666m 229m 1028 S  600  0.2   1:45.25 pbzip2
                              Here, java is using ~14 cores and pbzip2 about 6 cores while reading the file. The time was still under a minute:

                              Code:
                              bushnell@gpint209:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.bz2 out=clumped.fq.bz2 -Xmx63g reorder dedupe
                              java version "1.8.0_31"
                              Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
                              Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
                              java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.bz2 out=clumped.fq.bz2 -Xmx63g reorder dedupe
                              Executing clump.Clumpify [in=chiayi.fq.bz2, out=clumped.fq.bz2, -Xmx63g, reorder, dedupe]
                              
                              
                              Clumpify version 37.22
                              Read Estimate:          36800962
                              Memory Estimate:        28076 MB
                              Memory Available:       50656 MB
                              Set groups to 1
                              Executing clump.KmerSort [in1=chiayi.fq.bz2, in2=null, out1=clumped.fq.bz2, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g, reorder, dedupe]
                              
                              Making comparator.
                              Made a comparator with k=31, seed=1, border=1, hashes=4
                              Starting cris 0.
                              Fetching reads.
                              Making fetch threads.
                              Starting threads.
                              Waiting for threads.
                              Fetch time:     18.779 seconds.
                              Closing input stream.
                              Combining thread output.
                              Combine time:   0.153 seconds.
                              Sorting.
                              Sort time:      4.351 seconds.
                              Making clumps.
                              Clump time:     21.613 seconds.
                              Deduping.
                              Dedupe time:    0.795 seconds.
                              Writing.
                              Waiting for writing to complete.
                              Write time:     6.608 seconds.
                              Done!
                              Time:                           52.520 seconds.
                              Reads Processed:      27914k    531.50k reads/sec
                              Bases Processed:       1423m    27.11m bases/sec
                              
                              Reads In:           27914336
                              Clumps Formed:       2997579
                              Duplicates Found:   20066115
                              Total time:     52.575 seconds.
                              ...and there was no crash. However, I was able to replicate the crash when I set it to use -Xmx16g. I'll have to fix that... it appears to only manifest when using Clumpify with bz2 output and multiple groups (due to low memory). So you can avoid the crash by changing to gzipped instead of bzipped output for Clumpify (input can still be bz2).

                              Is it possible for you to try running this with the Oracle JDK and see if that resolves the slowdown?
                              Last edited by Brian Bushnell; 05-18-2017, 11:21 AM.

                              Comment


                              • #75
                                I tried to tune several places and here's a summary of what I found:
                                1.
                                Code:
                                Executing clump.Clumpify [-Xmx16g, in=in.fastq.gz, out=out.fq.gz, dedupe, reorder]
                                [dedupe reorder][to reproduce Brian's results] In the runs for my previous post, -Xmx was ~80% of the physical memory. With that setting, Oracle JDK (88.88k reads/sec) is ~10 tims faster than Open JDK (9.80k reads/sec). However, when I adjusted -Xmx to 50% of physical memory, the speed increased ~10 times for Oracle JDK (1222.36k reads/sec) and ~120 times for Open JDK (1200.63k reads/sec). Adding reorder and/or addcount didn't make much difference; -Xmx to physical memory ratio is the key.
                                2.
                                [dedupe reorder optical dupedist=40][setting in my original post] Then I added back optical and dupedist tags (with -Xmx at 50% of the physical meory). The run was stuck at dedupting like before.
                                Code:
                                   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                 97955 cc5544    20   0 25.117g 9.998g  12388 S 100.0  4.0  36:38.45 java
                                 98052 cc5544    20   0 1924368  17736    700 S   0.0  0.0   1:30.90 pigz
                                Brian, could you try adding these two tags and see if this is reproducible at your end? Thanks a lot!

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                51 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X