SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sam file smaller than fastq scami Bioinformatics 6 10-01-2015 05:25 AM
Read in mutiple gzipped FastQ files using R Nicolas_15 Bioinformatics 4 09-04-2015 01:47 PM
fastx quality trimmer and gzipped fastq balsampoplar Bioinformatics 4 03-10-2014 06:53 AM
Script for breaking large .fa files into smaller files of [N] sequences lac302 Bioinformatics 3 02-21-2014 04:49 PM
Split fastq into smaller files lorendarith Bioinformatics 10 12-13-2012 04:28 AM

Reply
 
Thread Tools
Old 04-27-2017, 09:25 AM   #61
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Quote:
Originally Posted by GenoMax View Post
I also wanted to see counts (with associated sequence) to see how acute of a problem the duplicates may be.
Update! Clumpify can now report the number of duplicates per read. Usage:

Code:
clumpify.sh in=file.fq out=deduped.fq dedupe addcount
Note that this does NOT work in "markduplicates" mode, only in "dedupe" mode. Reads that absorbed duplicates will have "copies=X" appended to the end to indicate how many reads they represent (including themselves, so the minimum you would see is 2).

Last edited by Brian Bushnell; 05-04-2017 at 03:06 PM.
Brian Bushnell is offline   Reply With Quote
Old 04-27-2017, 09:37 AM   #62
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,302
Default

Wonderful!

Will give it a try as soon as I can get bbmap upgraded to latest on the cluster.
GenoMax is offline   Reply With Quote
Old 05-04-2017, 09:48 AM   #63
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Quote:
Originally Posted by Brian Bushnell View Post
Note that this does NOT work in "markduplicates" mode, only in "dedupe" mode. Reads that absorbed duplicates will have "count=X" appended to the end to indicate how many reads they represent (including themselves, so the minimum you would see is 2).
Hi Brian,

Thank you for adding these new features. I tried clumpify (37.17) to detect duplicates in the HiSeq 2500 single-ended strand-specific 50 bp reads. My pipeline is Clumpify-BBDuk-BBMap-featureCounte-a) edgeR for differential expression analysis or b) network construction. I got some qestuions about performance and output after several trials on dedup.

1) I noticed that when I added dedupe=t, the speed slowed down for ~10 time. Is this expected or did I set some options incorrectly?

No dedup:
Code:
clumpify.sh -Xmx44g in=$INPUT out=$OUTPUT.clumped.fq.gz reorder=t overwrite=t

Time:                           3540.405 seconds.
Reads Processed:      27914k    7.88k reads/sec
Bases Processed:       1423m    0.40m bases/sec

Reads In:           27914336
Clumps Formed:       2997579
Total time:     3540.575 seconds.
Remove optical duplicates:
Code:
clumpify.sh -Xmx48g in=$INPUT out=$OUTPUT.clumped.fq.gz reorder=t overwrite=t \
dedupe=t addcount=t dupedist=40 optical=t

Time:                           38030.384 seconds.
Reads Processed:      27914k    0.73k reads/sec
Bases Processed:       1423m    0.04m bases/sec

Reads In:           27914336
Clumps Formed:       2997579
Duplicates Found:      27931
Total time:     41299.460 seconds.
2) For the output, where is the 'count=x' supposed to be stored? I did not see it in the out fq file.

3) I would like to get an idea about the ratio of optical duplicates vs PCR duplicates. To do that, should I run the first time with optical=t and the second time with optical=f, and then compare the difference between these two runs to get the PCR duplicates?

Thanks a lot!
chiayi is offline   Reply With Quote
Old 05-04-2017, 09:51 AM   #64
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,302
Default

Quote:
Originally Posted by chiayi View Post
2) For the output, where is the 'count=x' supposed to be stored? I did not see it in the out fq file.
It is stored in the fastq header in the "clumped" file. Only where clustered reads were found.

Quote:
3) I would like to get an idea about the ratio of optical duplicates vs PCR duplicates. To do that, should I run the first time with optical=t and the second time with optical=f, and then compare the difference between these two runs to get the PCR duplicates?

Thanks a lot!
Code:
dedupe=f optical=f (default)
Nothing happens with regards to duplicates.

dedupe=t optical=f
All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.

dedupe=f optical=t
Nothing happens.

dedupe=t optical=t

Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.
The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.
You should add "spantiles=f" for all non-NextSeq data. For optical duplicates you would need to specify "dupedist=". There are suggested values in this thread.

Last edited by GenoMax; 05-04-2017 at 10:29 AM.
GenoMax is offline   Reply With Quote
Old 05-04-2017, 10:12 AM   #65
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

1) No, that's not expected, nor do I see a reason why that might be happening. Also, the performance is amazingly slow on your system in both cases. Here's what I get:

Code:
clumpify.sh in=reads.fq.gz out=clumped.fq.gz reorder

Time:                           40.496 seconds.
Reads Processed:      18059k    445.97k reads/sec
Bases Processed:       1824m    45.04m bases/sec

Reads In:           18059968
Clumps Formed:        809674
Total time:     40.680 seconds.
Code:
clumpify.sh in=reads.fq.gz out=clumped2.fq.gz reorder dedupe optical dupedist=40 addcount

Time:                           39.087 seconds.
Reads Processed:      18059k    462.04k reads/sec
Bases Processed:       1824m    46.67m bases/sec

Reads In:           18059968
Clumps Formed:        405473
Duplicates Found:        864
Total time:     39.135 seconds.
These are respectively about 400 and 4000 times faster than what you're seeing... I can only speculate that it might have something to do with network congestion... also, do you have pigz in your path? When you look at top while the process is running, what is the reported CPU utilization of java and pigz?

2) The count is only present for reads that had duplicates; duplicate-free reads won't have it. For example, it looks like this:

Code:
reformat.sh in=reads.fq.gz out=1pair.fq reads=1
cat 1pair.fq 1pair.fq 1pair.fq > 3pair.fq
clumpify.sh in=3pair.fq out=deduped_3pair.fq dedupe addcount
cat deduped_3pair.fq

@HISEQ13:296:CA8E8ANXX:7:1101:1432:1903 1:N:0:ACGGAACTGTTCCG copies=3
NTTGGATTGCATTAAGGAGCGAGGATGAGCATTCCATTCACCCGCTGGCCGGAAGAGTTTGCCCGTCGCTATCGGGAAAAAGGCTACTGGCAGGATTTGCC
+
!<=@[email protected]/>EGGGGGGCG>DGGGGGGGCGGGGGGGGB>GGGGGGGGGGGGGCDGED
@HISEQ13:296:CA8E8ANXX:7:1101:1432:1903 2:N:0:ACGGAACTGTTCCG copies=3
TCGTTGAGCAGTTGCACCACGCGAATGGAGGAATGTTCTGTGACGAAAGTATTGAGGAAATCATCCCCGCTAAACAGCGCATGTTGGCGATCGGCAATCAG
+
BBBBBF0=1;@FGGDDGEGGDGD/EGGG:>[email protected]>GGGGGGFD8?D<FGDG/EGC
Brian Bushnell is offline   Reply With Quote
Old 05-04-2017, 12:00 PM   #66
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

1) pigz is in my path.
*Updated* When I monitored a running job, here's what I saw:
Code:
PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
20   0 56.006g 0.028t  13060 S 100.3 22.8 343:27.03 java
20   0   48424   1864   1324 S   0.0  0.0   0:00.00 pigz
It looked java is using all the requested CPUs. Also, the VM reached the limit of requested memory while the physical did not. I also confirmed that there's no limit on virtual memory. Any thought what might have caused the job stuck like this?

2) I was looking for 'count=x' in the header thus didn't find anything. When I looked for 'copies=X' then I saw the duplicates.

Thank you!

Last edited by chiayi; 05-04-2017 at 02:37 PM.
chiayi is offline   Reply With Quote
Old 05-04-2017, 03:10 PM   #67
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Quote:
Originally Posted by chiayi View Post
1) pigz is in my path.
*Updated* When I monitored a running job, here's what I saw:
Code:
PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
20   0 56.006g 0.028t  13060 S 100.3 22.8 343:27.03 java
20   0   48424   1864   1324 S   0.0  0.0   0:00.00 pigz
It looked java is using all the requested CPUs. Also, the VM reached the limit of requested memory while the physical did not. I also confirmed that there's no limit on virtual memory. Any thought what might have caused the job stuck like this?
Oh... in rare cases, "reorder" can cause it to run very slowly, stuck at 100% CPU utilization. That might be the issue here... try removing that flag. During normal execution, pigz is also using CPU-time and java is usually substantially higher than 100%. What did the screen output look like when top was showing this?

Quote:
2) I was looking for 'count=x' in the header thus didn't find anything. When I looked for 'copies=X' then I saw the duplicates.
Ooops, sorry, I updated my post to correct that!
Brian Bushnell is offline   Reply With Quote
Old 05-04-2017, 06:02 PM   #68
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Quote:
Originally Posted by Brian Bushnell View Post
Oh... in rare cases, "reorder" can cause it to run very slowly, stuck at 100% CPU utilization. That might be the issue here... try removing that flag. During normal execution, pigz is also using CPU-time and java is usually substantially higher than 100%. What did the screen output look like when top was showing this?
It was at the 'Deduping' step. I removed reorder tag and it still stuck at the same step:

Code:
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
java -ea -Xmx48g -Xms48g -cp ~/package/bbmap/current/ clump.Clumpify 
-Xmx48g in=in.fastq.gz out=out.fq.bz2 dedupe=t addcount=t dupedist=40 
optical=t
Executing clump.Clumpify [-Xmx48g, in=in.fastq.gz,
out=out.clumped.fq.bz2, dedupe=t, addcount=t, dupedist=40, optical=t]

Clumpify version 37.17
Read Estimate:          30447286
Memory Estimate:        13555 MB
Memory Available:       38586 MB
Set groups to 1
Executing clump.KmerSort [in1=in.fas in2=null, out1=out.clumped.fq.bz2, 
out2=null, groups=1, ecco=false, rename=false, shortname=f, 
unpair=false, repair=false, namesort=false, 
ow=true, -Xmx48g, dedupe=t, addcount=t, dupedist=40, optical=t]

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Starting cris 0.
Fetching reads.
Making fetch threads.
Starting threads.
Waiting for threads.
Fetch time:     95.473 seconds.
Closing input stream.
Combining thread output.
Combine time:   0.133 seconds.
Sorting.
Sort time:      36.426 seconds.
Making clumps.
Clump time:     6.788 seconds.
Deduping.
At the time the top looked like this:

Code:
PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
20   0 56.205g 0.023t  13008 S  99.7 18.7  10:00.83 java
20   0  188364   3932   2448 R   0.3  0.0   0:00.75 top
20   0   81452   1256   1012 S   0.0  0.0   0:00.00 pbzip2
I noticed the unpair=f and wasn't sure it was appropriate with my SE reads so I added unpair=t. Then the speed looked better yet it stuck at a different place:

Code:
Executing clump.KmerSplit [in1=in.fastq.gz, in2=null, 
out=out.fq.bz2, out2=null, groups=11, ecco=false, addname=f, 
shortname=f, unpair=true, repair=f, namesort=f, ow=true, 
-Xmx16g, dedupe=t, addcount=t, dupedist=40, optical=t]

Input is being processed as unpaired
Made a comparator with k=31, seed=1, border=1, hashes=4
Time:                           54.743 seconds.
Reads Processed:      27914k    509.91k reads/sec
Bases Processed:       1423m    26.01m bases/sec
Executing clump.KmerSort3 [in1=in.clumped_clumpify_p1_temp%_2e77f340ad809301.fq.bz2, in2=null,
 out=out.clumped.fq.bz2, out2=null, groups=11, ecco=f, addname=false,
 shortname=f, unpair=f, repair=false,
 namesort=false, ow=true, -Xmx16g, dedupe=t, addcount=t, dupedist=40, optical=t]

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Making 2 fetch threads.
Starting threads.
Fetching reads.
Exception in thread "Thread-57"
 *Control-C or similar caught [sig=15], quitting...
Exception in thread "Thread-58" Terminator thread: premature exit requested - quitting...
java.lang.RuntimeException: Duplicate process for file nutrinetat_vegb11_edi-0_r2y.clumped_clumpify_p1_temp0_2e77f340ad809301.fq.bz2
        at fileIO.ReadWrite.addProcess(ReadWrite.java:1599)
        at fileIO.ReadWrite.getInputStreamFromProcess(ReadWrite.java:1050)
        at fileIO.ReadWrite.getUnpbzip2Stream(ReadWrite.java:986)
        at fileIO.ReadWrite.getBZipInputStream2(ReadWrite.java:1086)
        at fileIO.ReadWrite.getBZipInputStream(ReadWrite.java:1066)
        at fileIO.ReadWrite.getInputStream(ReadWrite.java:802)
        at fileIO.ByteFile1.open(ByteFile1.java:261)
        at fileIO.ByteFile1.<init>(ByteFile1.java:96)
        at fileIO.ByteFile.makeByteFile(ByteFile.java:26)
        at stream.FastqReadInputStream.<init>(FastqReadInputStream.java:61)
        at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:119)
        at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:55)
        at clump.KmerSort3$FetchThread.fetchNext(KmerSort3.java:853)
        at clump.KmerSort3$FetchThread.run(KmerSort3.java:825)
When this happened, top looked like this,

Code:
PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
20   0  188332   3956   2452 R   0.3  0.0   0:04.02 top
20   0 23.961g 4.125g  12916 S   0.0  1.6   0:42.05 java
20   0   81452   1256   1012 S   0.0  0.0   0:00.01 pbzip2
20   0 1950332 173572   1056 S   0.0  0.1   0:03.36 pbzip2
20   0 1941568 173768   1052 S   0.0  0.1   0:03.49 pbzip2
Not sure if it's related but is the "unpair=f" expected during the clump, even though I added unpair=t in the command?
chiayi is offline   Reply With Quote
Old 05-04-2017, 06:36 PM   #69
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

OK, you can kill it - it won't finish. This is great feedback, by the way - I really appreciate it.

1) What version of BBMap are you using?
2) "unpair=f" is the default and that flag should always be ignored except when you are doing error-correction; it is exclusively for error-correcting paired reads. I don't see why that would have anything to do with this problem, but I'm not really sure.
3) This dataset is pretty small. Would it be possible for you to send it to me so I can try to replicate the issue?
4) I will note that your second version was run differently than the first one - it was using only 16GB RAM. The behavior of Clumpify is different when running with enough memory to hold all reads, and with not enough memory to hold all reads (in which case it needs to write temp files). It works in both cases, but when diagnosing errors, it's easiest to run with the same -Xmx parameter in all cases.

Thanks,
Brian
Brian Bushnell is offline   Reply With Quote
Old 05-04-2017, 06:51 PM   #70
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

@santiagorevale

Sorry for not clearly stating this before, but the latest versions of Clumpify support in1, in2, out1, and out2 flags for paired reads in twin files.

Last edited by Brian Bushnell; 05-04-2017 at 06:59 PM.
Brian Bushnell is offline   Reply With Quote
Old 05-04-2017, 08:50 PM   #71
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Quote:
Originally Posted by Brian Bushnell View Post
1) What version of BBMap are you using?
2) "unpair=f" is the default and that flag should always be ignored except when you are doing error-correction; it is exclusively for error-correcting paired reads. I don't see why that would have anything to do with this problem, but I'm not really sure.
3) This dataset is pretty small. Would it be possible for you to send it to me so I can try to replicate the issue?
4) I will note that your second version was run differently than the first one - it was using only 16GB RAM. The behavior of Clumpify is different when running with enough memory to hold all reads, and with not enough memory to hold all reads (in which case it needs to write temp files). It works in both cases, but when diagnosing errors, it's easiest to run with the same -Xmx parameter in all cases.
1) 37.17
2) 4) I changed the RAM to get the job start sooner. You are certainly right, the setting should be identical. I started another run using the same mem and this time it still stuck at the dedupting step. Summary: With -Xmx48g, the run stuck at dedupting, regardless the settings of reorder and unpair.
3) I shared the file with your email. Please let me know if you didn't get it.

Thank you very much for your time and patience.
chiayi is offline   Reply With Quote
Old 05-04-2017, 09:01 PM   #72
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Got it; I'm working on it. It will take several days. But thanks!
Brian Bushnell is offline   Reply With Quote
Old 05-05-2017, 03:04 PM   #73
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

OK, I can't replicate the slowness. I get this:

Code:
[email protected]:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g
Executing clump.Clumpify [in=chiayi.fq.gz, out=clumped.fq.gz, -Xmx63g]


Clumpify version 37.22
Read Estimate:          30447286
Memory Estimate:        13555 MB
Memory Available:       50656 MB
Set groups to 1
Executing clump.KmerSort [in1=chiayi.fq.gz, in2=null, out1=clumped.fq.gz, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g]

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Starting cris 0.
Fetching reads.
Making fetch threads.
Starting threads.
Waiting for threads.
Fetch time:     18.757 seconds.
Closing input stream.
Combining thread output.
Combine time:   0.127 seconds.
Sorting.
Sort time:      3.903 seconds.
Making clumps.
Clump time:     1.112 seconds.
Writing.
Waiting for writing to complete.
Write time:     15.535 seconds.
Done!
Time:                           39.821 seconds.
Reads Processed:      27914k    701.00k reads/sec
Bases Processed:       1423m    35.75m bases/sec

Reads In:           27914336
Clumps Formed:       2997579
Total time:     40.089 seconds.
[email protected]:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder
Executing clump.Clumpify [in=chiayi.fq.gz, out=clumped.fq.gz, -Xmx63g, reorder]


Clumpify version 37.22
Read Estimate:          30447286
Memory Estimate:        13555 MB
Memory Available:       50656 MB
Set groups to 1
Executing clump.KmerSort [in1=chiayi.fq.gz, in2=null, out1=clumped.fq.gz, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g, reorder]

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Starting cris 0.
Fetching reads.
Making fetch threads.
Starting threads.
Waiting for threads.
Fetch time:     18.471 seconds.
Closing input stream.
Combining thread output.
Combine time:   0.170 seconds.
Sorting.
Sort time:      4.112 seconds.
Making clumps.
Clump time:     19.301 seconds.
Writing.
Waiting for writing to complete.
Write time:     13.423 seconds.
Done!
Time:                           56.050 seconds.
Reads Processed:      27914k    498.02k reads/sec
Bases Processed:       1423m    25.40m bases/sec

Reads In:           27914336
Clumps Formed:       2997579
Total time:     56.125 seconds.
[email protected]:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder dedupe
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder dedupe
Executing clump.Clumpify [in=chiayi.fq.gz, out=clumped.fq.gz, -Xmx63g, reorder, dedupe]


Clumpify version 37.22
Read Estimate:          30447286
Memory Estimate:        13555 MB
Memory Available:       50656 MB
Set groups to 1
Executing clump.KmerSort [in1=chiayi.fq.gz, in2=null, out1=clumped.fq.gz, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g, reorder, dedupe]

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Starting cris 0.
Fetching reads.
Making fetch threads.
Starting threads.
Waiting for threads.
Fetch time:     18.377 seconds.
Closing input stream.
Combining thread output.
Combine time:   0.174 seconds.
Sorting.
Sort time:      4.421 seconds.
Making clumps.
Clump time:     19.694 seconds.
Deduping.
Dedupe time:    0.767 seconds.
Writing.
Waiting for writing to complete.
Write time:     5.675 seconds.
Done!
Time:                           49.223 seconds.
Reads Processed:      27914k    567.09k reads/sec
Bases Processed:       1423m    28.92m bases/sec

Reads In:           27914336
Clumps Formed:       2997579
Duplicates Found:   20066115
Total time:     49.299 seconds.
So, it takes under a minute regardless of whether I add "dedupe" or "reorder". The primary difference I see so far is that you are using OpenJDK while I'm using the Oracle JDK. Well, you're also using pbzip2 instead of gzip. But even when I converted to bz2 it was still fast; the output of top looked like this:

Code:
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9032 bushnell  20   0 66.1g  16g  11m S 1474 13.5   1:16.19 java
 9177 bushnell  20   0  666m 229m 1028 S  600  0.2   1:45.25 pbzip2
Here, java is using ~14 cores and pbzip2 about 6 cores while reading the file. The time was still under a minute:

Code:
[email protected]:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.bz2 out=clumped.fq.bz2 -Xmx63g reorder dedupe
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.bz2 out=clumped.fq.bz2 -Xmx63g reorder dedupe
Executing clump.Clumpify [in=chiayi.fq.bz2, out=clumped.fq.bz2, -Xmx63g, reorder, dedupe]


Clumpify version 37.22
Read Estimate:          36800962
Memory Estimate:        28076 MB
Memory Available:       50656 MB
Set groups to 1
Executing clump.KmerSort [in1=chiayi.fq.bz2, in2=null, out1=clumped.fq.bz2, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g, reorder, dedupe]

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Starting cris 0.
Fetching reads.
Making fetch threads.
Starting threads.
Waiting for threads.
Fetch time:     18.779 seconds.
Closing input stream.
Combining thread output.
Combine time:   0.153 seconds.
Sorting.
Sort time:      4.351 seconds.
Making clumps.
Clump time:     21.613 seconds.
Deduping.
Dedupe time:    0.795 seconds.
Writing.
Waiting for writing to complete.
Write time:     6.608 seconds.
Done!
Time:                           52.520 seconds.
Reads Processed:      27914k    531.50k reads/sec
Bases Processed:       1423m    27.11m bases/sec

Reads In:           27914336
Clumps Formed:       2997579
Duplicates Found:   20066115
Total time:     52.575 seconds.
...and there was no crash. However, I was able to replicate the crash when I set it to use -Xmx16g. I'll have to fix that... it appears to only manifest when using Clumpify with bz2 output and multiple groups (due to low memory). So you can avoid the crash by changing to gzipped instead of bzipped output for Clumpify (input can still be bz2).

Is it possible for you to try running this with the Oracle JDK and see if that resolves the slowdown?

Last edited by Brian Bushnell; 05-18-2017 at 11:21 AM.
Brian Bushnell is offline   Reply With Quote
Old 05-06-2017, 06:19 AM   #74
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

I tried to tune several places and here's a summary of what I found:
1.
Code:
Executing clump.Clumpify [-Xmx16g, in=in.fastq.gz, out=out.fq.gz, dedupe, reorder]
[dedupe reorder][to reproduce Brian's results] In the runs for my previous post, -Xmx was ~80% of the physical memory. With that setting, Oracle JDK (88.88k reads/sec) is ~10 tims faster than Open JDK (9.80k reads/sec). However, when I adjusted -Xmx to 50% of physical memory, the speed increased ~10 times for Oracle JDK (1222.36k reads/sec) and ~120 times for Open JDK (1200.63k reads/sec). Adding reorder and/or addcount didn't make much difference; -Xmx to physical memory ratio is the key.
2.
[dedupe reorder optical dupedist=40][setting in my original post] Then I added back optical and dupedist tags (with -Xmx at 50% of the physical meory). The run was stuck at dedupting like before.
Code:
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 97955 cc5544    20   0 25.117g 9.998g  12388 S 100.0  4.0  36:38.45 java
 98052 cc5544    20   0 1924368  17736    700 S   0.0  0.0   1:30.90 pigz
Brian, could you try adding these two tags and see if this is reproducible at your end? Thanks a lot!
chiayi is offline   Reply With Quote
Old 05-09-2017, 04:38 PM   #75
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Hi Chiayi,

I can't replicate the slowdown from -Xmx settings - that seems to be a result of your filesystem and virtual memory, caching, and overcommit settings, which are causing disk-swapping. But I'm glad you got it working at a reasonable speed, and hopefully this will help others who have had extremely slow performance in some situations.

I've identified the problem causing the slowdown with optical deduplication. It's because in your dataset there is one huge clump of 293296 reads, with a huge number of duplicates that are not optical duplicates. In that situation the performance can become O(N^2) with the size of the clump, which is very slow (though it's still making progress), since it currently compares every duplicate to every other duplicate to find if they are within the distance limit of each other, and both headers are parsed every time. I've modified it to be 5x faster now, and I am continuing to modify it to be faster still by sorting based on lane and tile number; hopefully, in most cases, it can become >100x faster.
Brian Bushnell is offline   Reply With Quote
Old 05-11-2017, 03:58 PM   #76
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

Hi Chiayi,

I just released v37.23 which has this issue fixed. The time for optical deduplication of that file dropped from 59436 seconds to 146 seconds, which is a pretty nice increase
Brian Bushnell is offline   Reply With Quote
Old 05-12-2017, 08:38 AM   #77
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Hi Brian,

Thank you so much for all the troubleshooting and effort. I really appreciate it.

I also worked with our IT and found that the slow down when I set -Xmx to ~80% of physical memeory was core specific and may be caused by the performance of different CPUs. I thought this might be relavant to others who also experience similar situation.

Thanks again for the time and developing such a great suite of tools.

Best,
Chia-Yi
chiayi is offline   Reply With Quote
Old 05-18-2017, 11:29 AM   #78
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,553
Default

I've now released 37.24 which has some nice optical deduplication improvements. It's now faster (Chiayi's dataset now takes 62 seconds), and there are improvements in precision for NextSeq tile-edge duplicates. Specifically, it is now recommended that they be removed like this:

clumpify.sh in=nextseq.fq.gz out=clumped.fq.gz dedupe optical spany adjacent

This will remove all normal optical duplicates, and all tile-edge duplicates, but it will only consider reads to be tile-edge duplicates if they are in adjacent tiles and share their Y-coordinate (within dupedist), rather than before, in which they could be in any tiles and could share their X-coordinate. This means that there are fewer false-positives (PCR or coincidental duplicates that were being classified as optical/tile-edge duplicates). This is possible because on NextSeq the tile-edge duplicates are only present on the tile X-edges and the duplicates are only between adjacent tiles.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bbduk, bbmap, bbmerge, clumpify, compression, pigz, reformat, tadpole

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO