Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • dedupe.sh

    Dear Brian,

    I have been looking for a tool that would quickly dereplicate (100% containments) nucleotide sequences and track for each unique sequence the identifiers of the removed duplicates.

    Something like:

    dedupe.sh in=in.fa out=out.fa outd=outd.fa mid=100 mop=100

    where:

    in.fa:
    seq1
    seq2 (contained in seq1)
    seq3 (contained in seq1)
    seq4

    out.fa:
    seq1
    seq4

    outd.fa:
    seq2
    seq3

    I am interested in:
    seq1<tab>seq2,seq3
    seq4

    dedupe.sh does a fantastic job in returning out and outd, but I cannot find any option that would return the information I am interested in. Is this something that I am missing? Otherwise, I believe this could be a great feature, since compared to other tools that return this information, dedupe is so much faster.

    Best,
    Shini

    Comment


    • Hi Shini,

      Dedupe technically has the ability to do that, although currently it doesn't actually do it. Hmmm...

      It does have a "uniqueonly" flag; if you use that, it will only send sequences to "out" if they are unique, and anything nonunique gets sent to "outd" (so if sequences A and B are duplicates, both get sent to outd). At a minimum, that will make the problem much smaller so that maybe it can be dealt with by a slower program that has the desired output format, if there is one. For mostly non-redundant data, that should accelerate things by a huge factor.

      I don't have very much free time these days, so while I agree that this would be useful, I'm not sure when I'd have an opportunity to implement it. Therefore - I've added it to my TODO list, but that list is pretty long. Do you happen to have a programmer at your disposal? If so, Dedupe's dot output (for example, "dot=overlaps.dot") does contain all of the information of which sequences overlap which other sequences (if you run Dedupe with the correct flags - "am=f ac=f fo dot=dot.txt"), and can be transformed into the text output you're looking for. It would be very difficult for a non-programmer to do the transformation. Though, it is readable by non-programmers, so you might take a look.

      Sorry I can't help further, though I will try to add that functionality when possible.

      Comment


      • Hi Brian,

        Thanks for you quick response, really appreciate you took some of your limited time!

        I hoped the dot file might be just what I was looking for, so I ran

        dedupe.sh in=in.fa out=out.fa outd=outd.fa am=t ac=t fo pc dot=dot.txt

        For clarification: I set
        am=t (100% identical sequences SHOULD be absorbed)
        ac=t (100% contained sequences SHOULD be absorbed)
        fo=f (I do not want to cluster/assemble sequences)
        dot=dot.txt

        However, the dot file appears to contain the indices of the unique input sequences only, but what is missing are the indices of all contained sequences, that is, you get the same result irrespective of the number of different (or same) substring sequences of a given sequence (while I would be interested in ALL the indices/ids of the different sequences).

        To give you an idea on the speedup, I tested two programs that would output the information of interest and they both took about 1 hour to dereplicate 1 M sequences, while dedupe does it in 25 seconds (!). To be fair, I am not sure how much overhead it is to keep track of the ids/indices and to print them.

        Anyway, the results seem equivalent, so it would be fantastic if this became a feature one day!

        Best,
        Shini

        Comment


        • Hi Shini,

          am=t (100% identical sequences SHOULD be absorbed)
          ac=t (100% contained sequences SHOULD be absorbed)
          fo=f (I do not want to cluster/assemble sequences)
          Even though you want to absorb them, things that are absorbed will not show up in the dot output. Also, I think you have to set "fo=t" because the dot file prints out all the overlaps that are found (which include containments). Dedupe runs in 3 phases:

          1) Load reads and find exact matches.
          2) Find containments.
          3) Find overlaps.

          Phase 1 and 2 do not actually generate the information needed for the dot file, only phase 3. So, you need "am=f ac=f fo=t". Sorry!

          As for the speed, yep, I put a lot of effort into making it very fast. I wrote it mainly because we used Minimus2 for deduplication our merged assemblies, which for large metagenomes would take days and then often crash or run out of time.

          Anyway, I'll take a look at it this weekend and see if that's something that's really easy to add.

          Comment


          • Just to make sure we are on the same page, I attach the toy input file I described earlier.

            Running:

            dedupe.sh in=in.fa out=out.fa outd=outd.fa am=t ac=t fo=f dot=dot.txt

            will give the desired output:
            seq1 and seq4 in out.fa
            seq2 and seq3 in outd.fa (since they are contained in seq1)

            but no dot.txt file.

            Running:

            dedupe.sh in=in.fa out=out.fa outd=outd.fa am=f ac=f fo=t dot=dot.txt

            as you suggest, will output all 4 sequences in out.fa and none in outd.fa (again no dot.txt). To get the dot.txt file written, I have to add pc=t, but still out.fa contains all 4 sequences, including the two 100% substrings (seq2,seq3) of seq1.

            Would be great if you found time to look into this!
            Attached Files

            Comment


            • Trying out bbmap for the first time...I'm aligning illumina 2x275 reads to a fungal genome. Error and sub rates for the first few samples are high, around 70%. I'm assuming the high error rate is due to the library prep (nextera xt). Insert size ranges from 400-2k. Any explanation for the high sub rate or are they both tied together. Thanks.

              Comment


              • BBMap reports error rates in two pairs of columns; the first is the total number and fraction of reads with any errors, the second is the total number and fraction of bases that are errors. Typically, for Illumina data, the most important number is be the per-base substitution rate, which is hopefully under 3% or so, but it can get pretty high with such long reads (I assume this was a MiSeq 2x300bp kit). Posting the entire screen output here might be helpful.

                Also, note that there are (at least?) two completely different Nextera protocols, one for fragment libraries and one for long-mate pair libraries. Since you have an insert range of 400-2000bp, I assume you are using Nextera long-mate pairs. Is that correct? These cannot be mapped directly, but need to be preprocessed first, as they have a chimeric junction somewhere. To do that, first adapter-trim using BBDuk, then run splitnextera.sh on them:

                splitnextera.sh in=<file> out=<file> outf=<file> outu=<file> outs=<file> mask=t

                This will remove the junction adapters and split the reads into multiple output files which should be mapped independently. out is for long-mate pairs; outf is for short fragments; outu is for unknown orientation pairs; and outs is for singletons. If nothing comes out as long mate pairs, then you don't have Nextera LMP data and should not perform this step. It's best to check with the people who prepared the library, though.
                Last edited by Brian Bushnell; 02-09-2016, 02:10 PM.

                Comment


                • Ouput below. This is a nextera XT fragment library that I made and sequenced. Average insert size is around 900bp. I used the defualut bbmap settings adding qtrim=r and pairlen=2000

                  Code:
                  BBMap version 34.94
                  Retaining first best site only for ambiguous mappings.
                  Executing dna.FastaToChromArrays2 [/data/U.maydis/GCA_000328475.2_Umaydis521_2.0_genomic.fna, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]
                  
                  Set genScaffoldInfo=true
                  Set genome to 1
                  
                  Loaded Reference:	0.005 seconds.
                  Loading index for chunk 1-1, build 1
                  Indexing threads started for block 0-1
                  Indexing threads finished for block 0-1
                  Generated Index:	1.753 seconds.
                  Analyzed Index:   	3.196 seconds.
                  Started output stream:	0.018 seconds.
                  Cleared Memory:    	0.180 seconds.
                  Processing reads in paired-ended mode.
                  Started read stream.
                  Started 12 mapping threads.
                  Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
                  
                     ------------------   Results   ------------------   
                  
                  Genome:                	1
                  Key Length:            	13
                  Max Indel:             	16000
                  Minimum Score Ratio:  	0.56
                  Mapping Mode:         	normal
                  Reads Used:           	10712672	(2767819477 bases)
                  
                  Mapping:          	2375.634 seconds.
                  Reads/sec:       	4509.39
                  kBases/sec:      	1165.09
                  
                  
                  Pairing data:   	pct reads	num reads 	pct bases	   num bases
                  
                  mated pairs:     	 95.9222% 	  5137917 	 95.9216% 	  2654935852
                  bad pairs:       	  0.3883% 	    20798 	  0.4112% 	    11380974
                  insert size avg: 	  457.71
                  
                  
                  Read 1 data:      	pct reads	num reads 	pct bases	   num bases
                  
                  mapped:          	 97.7238% 	  5234416 	 97.8595% 	  1353706554
                  unambiguous:     	 96.7062% 	  5179910 	 96.8763% 	  1340106299
                  ambiguous:       	  1.0176% 	    54506 	  0.9832% 	    13600255
                  low-Q discards:  	  0.0474% 	     2537 	  0.0064% 	       88795
                  
                  perfect best site:	 62.2963% 	  3336797 	 61.0063% 	   843909803
                  semiperfect site:	 62.3484% 	  3339588 	 61.0594% 	   844644584
                  rescued:         	  0.0469% 	     2511
                  
                  Match Rate:      	      NA 	       NA 	 97.9361% 	  1345904929
                  Error Rate:      	 36.1452% 	  1892020 	  2.0011% 	    27499816
                  Sub Rate:        	 35.0788% 	  1836199 	  0.4172% 	     5733414
                  Del Rate:        	  2.4959% 	   130648 	  1.4962% 	    20562060
                  Ins Rate:        	  1.7790% 	    93120 	  0.0876% 	     1204342
                  N Rate:          	  0.4722% 	    24718 	  0.0629% 	      863869
                  
                  
                  Read 2 data:      	pct reads	num reads 	pct bases	   num bases
                  
                  mapped:          	 96.5932% 	  5173855 	 96.6576% 	  1338226877
                  unambiguous:     	 95.5860% 	  5119909 	 95.6852% 	  1324764021
                  ambiguous:       	  1.0071% 	    53946 	  0.9724% 	    13462856
                  low-Q discards:  	  0.0478% 	     2561 	  0.0067% 	       92659
                  
                  perfect best site:	 28.1750% 	  1509145 	 24.9411% 	   345310812
                  semiperfect site:	 28.2006% 	  1510517 	 24.9662% 	   345657705
                  rescued:         	  0.0547% 	     2932
                  
                  Match Rate:      	      NA 	       NA 	 96.0340% 	  1307794049
                  Error Rate:      	 70.7813% 	  3662191 	  3.9015% 	    53130664
                  Sub Rate:        	 70.3060% 	  3637596 	  2.0494% 	    27908811
                  Del Rate:        	  2.8907% 	   149564 	  1.7313% 	    23576787
                  Ins Rate:        	  2.7860% 	   144147 	  0.1208% 	     1645066
                  N Rate:          	  0.5726% 	    29628 	  0.0645% 	      878951
                  
                  Total time:     	2381.122 seconds.
                  Last edited by GenoMax; 02-10-2016, 09:55 AM. Reason: Changed to CODE tags for better readability

                  Comment


                  • Oh, that's probably fine. You'd have a much higher error rate if this was a Nextera LMP library that had not been correctly preprocesed. As you can see, there is a 0.4% per-base substitution rate for read 1, which is pretty good for such long reads. Read 2 is higher, as expected. It's normal for long reads to have a high fraction containing at least one substitution.

                    Comment


                    • Is the insert size measured or inferred?

                      Comment


                      • Originally posted by GenoMax View Post
                        Is the insert size measured or inferred?
                        measured via fragment analyzer

                        Comment


                        • Thanks for the quick reply Brian...Does bbmap have any SNP calling functionality? Or should I just feed these .sam files into the samtools pipeline?

                          Comment


                          • BBMap has no SNP-calling functionality, so you'll need to use a variant-caller (samtools, GATK, FreeBayes, etc).

                            Comment


                            • sam files are an accepted input file type for dedupe.sh correct? Or would you recommend removing duplicates from the fastq files?

                              Code:
                              [root@g300-149-b0 bbmap]# ./dedupe.sh Um57.sam Um57_nodups.sam 
                              Max memory cannot be determined.  Attempting to use 3200 MB.
                              If this fails, please add the argument -Xmx29g (adjusted to roughly 85 percent of physical RAM).
                              java -Djava.library.path=/app/bbmap/jni/ -ea -Xmx3200m -Xms3200m -cp /app/bbmap/current/ jgi.Dedupe Um57.sam Um57_nodups.sam
                              Executing jgi.Dedupe [Um57.sam, Um57_nodups.sam]
                              
                              Initial:
                              Memory: free=3182m, used=34m
                              
                              Exception in thread "Thread-3" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-14" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-13" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-11" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-12" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-10" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-9" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-8" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-7" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-5" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-4" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Exception in thread "Thread-6" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
                              	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
                              	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
                              	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
                              Found 0 duplicates.
                              Finished exact matches.    Time: 0.072 seconds.
                              Memory: free=2930m, used=286m
                              
                              Found 0 contained sequences.
                              Finished containment.      Time: 0.006 seconds.
                              Memory: free=2712m, used=504m
                              
                              Removed 0 invalid entries.
                              Finished invalid removal.  Time: 0.001 seconds.
                              Memory: free=2712m, used=504m
                              
                              Input:                  	12 reads 		3295 bases.
                              Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
                              Containments:           	0 reads (0.00%) 	0 bases (0.00%)    	0 collisions.
                              Result:                 	0 reads (0.00%) 	3295 bases (100.00%)
                              
                              Printed output.            Time: 0.003 seconds.
                              Memory: free=2695m, used=521m
                              
                              Time:   			0.089 seconds.
                              Reads Processed:          12 	0.14k reads/sec
                              Bases Processed:        3295 	0.04m bases/sec
                              [root@g300-149-b0 bbmap]

                              Comment


                              • Originally posted by lac302 View Post
                                sam files are an accepted input file type for dedupe.sh correct? Or would you recommend removing duplicates from the fastq files?
                                Dedupe can't use sam files.

                                Code:
                                Input may be fasta or fastq, compressed or uncompressed.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                21 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X