Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #76
    Hi Chiayi,

    I can't replicate the slowdown from -Xmx settings - that seems to be a result of your filesystem and virtual memory, caching, and overcommit settings, which are causing disk-swapping. But I'm glad you got it working at a reasonable speed, and hopefully this will help others who have had extremely slow performance in some situations.

    I've identified the problem causing the slowdown with optical deduplication. It's because in your dataset there is one huge clump of 293296 reads, with a huge number of duplicates that are not optical duplicates. In that situation the performance can become O(N^2) with the size of the clump, which is very slow (though it's still making progress), since it currently compares every duplicate to every other duplicate to find if they are within the distance limit of each other, and both headers are parsed every time. I've modified it to be 5x faster now, and I am continuing to modify it to be faster still by sorting based on lane and tile number; hopefully, in most cases, it can become >100x faster.

    Comment


    • #77
      Hi Chiayi,

      I just released v37.23 which has this issue fixed. The time for optical deduplication of that file dropped from 59436 seconds to 146 seconds, which is a pretty nice increase

      Comment


      • #78
        Hi Brian,

        Thank you so much for all the troubleshooting and effort. I really appreciate it.

        I also worked with our IT and found that the slow down when I set -Xmx to ~80% of physical memeory was core specific and may be caused by the performance of different CPUs. I thought this might be relavant to others who also experience similar situation.

        Thanks again for the time and developing such a great suite of tools.

        Best,
        Chia-Yi

        Comment


        • #79
          I've now released 37.24 which has some nice optical deduplication improvements. It's now faster (Chiayi's dataset now takes 62 seconds), and there are improvements in precision for NextSeq tile-edge duplicates. Specifically, it is now recommended that they be removed like this:

          clumpify.sh in=nextseq.fq.gz out=clumped.fq.gz dedupe optical spany adjacent

          This will remove all normal optical duplicates, and all tile-edge duplicates, but it will only consider reads to be tile-edge duplicates if they are in adjacent tiles and share their Y-coordinate (within dupedist), rather than before, in which they could be in any tiles and could share their X-coordinate. This means that there are fewer false-positives (PCR or coincidental duplicates that were being classified as optical/tile-edge duplicates). This is possible because on NextSeq the tile-edge duplicates are only present on the tile X-edges and the duplicates are only between adjacent tiles.

          Comment


          • #80
            deduplication with clumpify

            Hello,

            I probably have a problem with PCR duplicates and thought I want to use clumpy to remove duplicates. I did some tests and realised if the reads don't have the same length they are not marked as duplicates. E.g. I remove one nucleotide from the end of the read and they are no more marked as duplicates.

            Is this behaviour voluntarily or a bug? I think that if tow paired end reads start at the same position and are identical (except some missmatches allowed) they can be considered PCR duplicates, can't they. The paris of reads doesn't necessarily need to stop at the same position. Especially if you recommend in the
            Processing guide to use duplication after quality trimming. during the trimming PCR duplicated reads can be trimmed to different lengths.

            In the quality trimming one pair might also be removed, and I don't know how to find duplicates between one single end and one paired end library.

            Could you help me?

            Comment


            • #81
              Originally posted by silask View Post

              Is this behaviour voluntarily or a bug? I think that if tow paired end reads start at the same position and are identical (except some missmatches allowed) they can be considered PCR duplicates, can't they. The paris of reads doesn't necessarily need to stop at the same position. Especially if you recommend in the
              Processing guide to use duplication after quality trimming. during the trimming PCR duplicated reads can be trimmed to different lengths.
              Here is my take (could be wrong). That processing guide may have been written before clumpify existed. You should use clumpify on raw data (before anything is done to it). That is the best way to identify duplicates. You can then follow that up with trimming.
              In the quality trimming one pair might also be removed, and I don't know how to find duplicates between one single end and one paired end library.

              Could you help me?
              If you are removing one read during quality trimming then ideally you should get rid of its mate from the paired file to keep the paired-end sequences in order.

              That said, if you wanted to find duplicates between one single end and one PE library, you could always reverse complement reads using reformat.sh and then run clumpify on two files at a time treating them as single end reads.

              Comment


              • #82
                Thank you GenoMax for the quick answer.

                Ok, I see. In the Process guide they don't talk of clumpify. If use clumpify before the trimming I don't have the problem of single-end - paired-end duplicates. However, clumpify sill needs that the reads are exact the same length, this is even more strange then the nucleotide at the end of the reads are likely gone trimmed away in the subsequent trim step.

                On a test set with two paired end raw reads which are normally detected as duplicates, I can prevent marking the reads as duplicates, by only removing one nt from the end.

                Comment


                • #83
                  On a test set with two paired end raw reads which are normally detected as duplicates, I can prevent marking the reads as duplicates, by only removing one nt from the end.
                  I am not sure exactly what you are referring to. Clumpify by default will allow two substitutions (errors if you will). If you want to do strict matching then use dupesubs=0. Can you include the command line options you are using?
                  Last edited by GenoMax; 04-18-2018, 03:24 AM.

                  Comment


                  • #84
                    Sorry. For example I have two reads, which are 250 and 251 nt long, and identical.
                    Clumpy doesn't mark them as duplicate even with dupesubs=2. I would say the reads are duplicates, what do you think?

                    Comment


                    • #85
                      Interesting point. I have always worked with data that was of uniform length. Based on what you have discovered clumpify does seem to have an underlying need/assumption that the reads are all equal length.

                      Two options come to mind:

                      1. You could trim that extra base off from end of the 251 bp reads to make them 250 bp by using bbduk.sh
                      2. You could try using dedupe.sh which can match subsequences.
                      Code:
                      dedupe.sh
                      
                      Written by Brian Bushnell and Jonathan Rood
                      Last modified March 9, 2017
                      
                      Description:  Accepts one or more files containing sets of sequences (reads or scaffolds).
                      Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.
                      Can also find overlapping sequences and group them into clusters.
                      Please read bbmap/docs/guides/DedupeGuide.txt for more information.
                      
                      Usage:     dedupe.sh in=<file or stdin> out=<file or stdout>

                      Comment


                      • #86
                        After clumpify command to remove duplicates using in1 in2 and out1 out2, it seems like it produces only 1 out, messing with the pipeline stream! Why does it happen?

                        ./bbmap/clumpify.sh in1=./Preproccesing/${ERR}/${ERR}_1_1.fastq.gz in2=./Preproccesing/${ERR}/${ERR}_2_1.fastq.gz out1=./Preproccesing/${ERR}/${ERR}_1_optical.fastq.gz out2=./Preproccesing/${ERR}/${ERR}_2_optical.fastq.gz dedupe=true optical=true overwrite=true

                        ------

                        Reset INTERLEAVED to false because paired input files were specified.
                        Set INTERLEAVED to false
                        Input is being processed as paired
                        Writing interleaved.
                        Made a comparator with k=31, seed=1, border=1, hashes=4
                        Time: 22.512 seconds.
                        Reads Processed: 13371k 593.99k reads/sec
                        Bases Processed: 1145m 50.88m bases/sec
                        Executing clump.KmerSort3 [in1=./Preproccesing/ERR522065/ERR522065_1_optical_clumpify_p1_temp%_10a607a7b7090ec6.fastq.gz, in2=, out=./Preproccesing/ERR522065/ERR522065_1_optical.fas
                        tq.gz, out2=, groups=11, ecco=f, addname=false, shortname=f, unpair=f, repair=false, namesort=false, ow=true]

                        ------

                        java -Djava.library.path=/mnt/scratchdir/home/kyriakidk/KIWI/bbmap/jni/ -ea -Xmx33412m -Xms33412m -cp /mnt/scratchdir/home/kyriakidk/KIWI/bbmap/current/ jgi.BBDukF in1=./Preproccesi
                        ng/ERR522065/ERR522065_1_optical.fastq.gz in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz
                        Executing jgi.BBDukF [in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz]
                        Version 38.11

                        No output stream specified. To write to stdout, please specify 'out=stdout.fq' or similar.
                        Exception in thread "main" java.lang.RuntimeException: Can't read file './Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz'
                        Last edited by kokyriakidis; 07-21-2018, 12:51 AM.

                        Comment


                        • #87
                          It looks like out1= and out2= variables are not being correctly expanded. BBMap seems to think that your outputs are inputs (in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz). Are the input files in the correct directory with the right names?

                          Comment


                          • #88
                            Originally posted by GenoMax View Post
                            It looks like out1= and out2= variables are not being correctly expanded. BBMap seems to think that your outputs are inputs (in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz). Are the input files in the correct directory with the right names?
                            Yes! all files are in the same folder! Actually neither clumpify dedupe optical, nor filterbytile work. So I have to remove them in order to complete my pipeline...

                            Comment


                            • #89
                              Are you using the latest version of BBMap? Have you tried to run a test with actual file names instead of shell variables?

                              Comment


                              • #90
                                I use the latest version of BBtools. I can't get it work

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                34 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                72 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                81 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X