SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 514 03-13-2020 03:57 AM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 06:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 03:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 01:10 AM

Reply
 
Thread Tools
Old 01-28-2016, 05:27 PM   #301
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Shini,

Dedupe technically has the ability to do that, although currently it doesn't actually do it. Hmmm...

It does have a "uniqueonly" flag; if you use that, it will only send sequences to "out" if they are unique, and anything nonunique gets sent to "outd" (so if sequences A and B are duplicates, both get sent to outd). At a minimum, that will make the problem much smaller so that maybe it can be dealt with by a slower program that has the desired output format, if there is one. For mostly non-redundant data, that should accelerate things by a huge factor.

I don't have very much free time these days, so while I agree that this would be useful, I'm not sure when I'd have an opportunity to implement it. Therefore - I've added it to my TODO list, but that list is pretty long. Do you happen to have a programmer at your disposal? If so, Dedupe's dot output (for example, "dot=overlaps.dot") does contain all of the information of which sequences overlap which other sequences (if you run Dedupe with the correct flags - "am=f ac=f fo dot=dot.txt"), and can be transformed into the text output you're looking for. It would be very difficult for a non-programmer to do the transformation. Though, it is readable by non-programmers, so you might take a look.

Sorry I can't help further, though I will try to add that functionality when possible.
Brian Bushnell is offline   Reply With Quote
Old 01-29-2016, 04:09 AM   #302
Shini Sunagawa
Junior Member
 
Location: Germany

Join Date: Jan 2016
Posts: 8
Default

Hi Brian,

Thanks for you quick response, really appreciate you took some of your limited time!

I hoped the dot file might be just what I was looking for, so I ran

dedupe.sh in=in.fa out=out.fa outd=outd.fa am=t ac=t fo pc dot=dot.txt

For clarification: I set
am=t (100% identical sequences SHOULD be absorbed)
ac=t (100% contained sequences SHOULD be absorbed)
fo=f (I do not want to cluster/assemble sequences)
dot=dot.txt

However, the dot file appears to contain the indices of the unique input sequences only, but what is missing are the indices of all contained sequences, that is, you get the same result irrespective of the number of different (or same) substring sequences of a given sequence (while I would be interested in ALL the indices/ids of the different sequences).

To give you an idea on the speedup, I tested two programs that would output the information of interest and they both took about 1 hour to dereplicate 1 M sequences, while dedupe does it in 25 seconds (!). To be fair, I am not sure how much overhead it is to keep track of the ids/indices and to print them.

Anyway, the results seem equivalent, so it would be fantastic if this became a feature one day!

Best,
Shini
Shini Sunagawa is offline   Reply With Quote
Old 01-29-2016, 08:53 PM   #303
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Shini,

Quote:
am=t (100% identical sequences SHOULD be absorbed)
ac=t (100% contained sequences SHOULD be absorbed)
fo=f (I do not want to cluster/assemble sequences)
Even though you want to absorb them, things that are absorbed will not show up in the dot output. Also, I think you have to set "fo=t" because the dot file prints out all the overlaps that are found (which include containments). Dedupe runs in 3 phases:

1) Load reads and find exact matches.
2) Find containments.
3) Find overlaps.

Phase 1 and 2 do not actually generate the information needed for the dot file, only phase 3. So, you need "am=f ac=f fo=t". Sorry!

As for the speed, yep, I put a lot of effort into making it very fast. I wrote it mainly because we used Minimus2 for deduplication our merged assemblies, which for large metagenomes would take days and then often crash or run out of time.

Anyway, I'll take a look at it this weekend and see if that's something that's really easy to add.
Brian Bushnell is offline   Reply With Quote
Old 01-30-2016, 01:19 AM   #304
Shini Sunagawa
Junior Member
 
Location: Germany

Join Date: Jan 2016
Posts: 8
Default

Just to make sure we are on the same page, I attach the toy input file I described earlier.

Running:

dedupe.sh in=in.fa out=out.fa outd=outd.fa am=t ac=t fo=f dot=dot.txt

will give the desired output:
seq1 and seq4 in out.fa
seq2 and seq3 in outd.fa (since they are contained in seq1)

but no dot.txt file.

Running:

dedupe.sh in=in.fa out=out.fa outd=outd.fa am=f ac=f fo=t dot=dot.txt

as you suggest, will output all 4 sequences in out.fa and none in outd.fa (again no dot.txt). To get the dot.txt file written, I have to add pc=t, but still out.fa contains all 4 sequences, including the two 100% substrings (seq2,seq3) of seq1.

Would be great if you found time to look into this!
Attached Files
File Type: gz in.fa.gz (101 Bytes, 0 views)
Shini Sunagawa is offline   Reply With Quote
Old 02-09-2016, 12:51 PM   #305
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Trying out bbmap for the first time...I'm aligning illumina 2x275 reads to a fungal genome. Error and sub rates for the first few samples are high, around 70%. I'm assuming the high error rate is due to the library prep (nextera xt). Insert size ranges from 400-2k. Any explanation for the high sub rate or are they both tied together. Thanks.
lac302 is offline   Reply With Quote
Old 02-09-2016, 01:07 PM   #306
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBMap reports error rates in two pairs of columns; the first is the total number and fraction of reads with any errors, the second is the total number and fraction of bases that are errors. Typically, for Illumina data, the most important number is be the per-base substitution rate, which is hopefully under 3% or so, but it can get pretty high with such long reads (I assume this was a MiSeq 2x300bp kit). Posting the entire screen output here might be helpful.

Also, note that there are (at least?) two completely different Nextera protocols, one for fragment libraries and one for long-mate pair libraries. Since you have an insert range of 400-2000bp, I assume you are using Nextera long-mate pairs. Is that correct? These cannot be mapped directly, but need to be preprocessed first, as they have a chimeric junction somewhere. To do that, first adapter-trim using BBDuk, then run splitnextera.sh on them:

splitnextera.sh in=<file> out=<file> outf=<file> outu=<file> outs=<file> mask=t

This will remove the junction adapters and split the reads into multiple output files which should be mapped independently. out is for long-mate pairs; outf is for short fragments; outu is for unknown orientation pairs; and outs is for singletons. If nothing comes out as long mate pairs, then you don't have Nextera LMP data and should not perform this step. It's best to check with the people who prepared the library, though.

Last edited by Brian Bushnell; 02-09-2016 at 01:10 PM.
Brian Bushnell is offline   Reply With Quote
Old 02-10-2016, 07:36 AM   #307
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Ouput below. This is a nextera XT fragment library that I made and sequenced. Average insert size is around 900bp. I used the defualut bbmap settings adding qtrim=r and pairlen=2000

Code:
BBMap version 34.94
Retaining first best site only for ambiguous mappings.
Executing dna.FastaToChromArrays2 [/data/U.maydis/GCA_000328475.2_Umaydis521_2.0_genomic.fna, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]

Set genScaffoldInfo=true
Set genome to 1

Loaded Reference:	0.005 seconds.
Loading index for chunk 1-1, build 1
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index:	1.753 seconds.
Analyzed Index:   	3.196 seconds.
Started output stream:	0.018 seconds.
Cleared Memory:    	0.180 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 12 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

   ------------------   Results   ------------------   

Genome:                	1
Key Length:            	13
Max Indel:             	16000
Minimum Score Ratio:  	0.56
Mapping Mode:         	normal
Reads Used:           	10712672	(2767819477 bases)

Mapping:          	2375.634 seconds.
Reads/sec:       	4509.39
kBases/sec:      	1165.09


Pairing data:   	pct reads	num reads 	pct bases	   num bases

mated pairs:     	 95.9222% 	  5137917 	 95.9216% 	  2654935852
bad pairs:       	  0.3883% 	    20798 	  0.4112% 	    11380974
insert size avg: 	  457.71


Read 1 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	 97.7238% 	  5234416 	 97.8595% 	  1353706554
unambiguous:     	 96.7062% 	  5179910 	 96.8763% 	  1340106299
ambiguous:       	  1.0176% 	    54506 	  0.9832% 	    13600255
low-Q discards:  	  0.0474% 	     2537 	  0.0064% 	       88795

perfect best site:	 62.2963% 	  3336797 	 61.0063% 	   843909803
semiperfect site:	 62.3484% 	  3339588 	 61.0594% 	   844644584
rescued:         	  0.0469% 	     2511

Match Rate:      	      NA 	       NA 	 97.9361% 	  1345904929
Error Rate:      	 36.1452% 	  1892020 	  2.0011% 	    27499816
Sub Rate:        	 35.0788% 	  1836199 	  0.4172% 	     5733414
Del Rate:        	  2.4959% 	   130648 	  1.4962% 	    20562060
Ins Rate:        	  1.7790% 	    93120 	  0.0876% 	     1204342
N Rate:          	  0.4722% 	    24718 	  0.0629% 	      863869


Read 2 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	 96.5932% 	  5173855 	 96.6576% 	  1338226877
unambiguous:     	 95.5860% 	  5119909 	 95.6852% 	  1324764021
ambiguous:       	  1.0071% 	    53946 	  0.9724% 	    13462856
low-Q discards:  	  0.0478% 	     2561 	  0.0067% 	       92659

perfect best site:	 28.1750% 	  1509145 	 24.9411% 	   345310812
semiperfect site:	 28.2006% 	  1510517 	 24.9662% 	   345657705
rescued:         	  0.0547% 	     2932

Match Rate:      	      NA 	       NA 	 96.0340% 	  1307794049
Error Rate:      	 70.7813% 	  3662191 	  3.9015% 	    53130664
Sub Rate:        	 70.3060% 	  3637596 	  2.0494% 	    27908811
Del Rate:        	  2.8907% 	   149564 	  1.7313% 	    23576787
Ins Rate:        	  2.7860% 	   144147 	  0.1208% 	     1645066
N Rate:          	  0.5726% 	    29628 	  0.0645% 	      878951

Total time:     	2381.122 seconds.

Last edited by GenoMax; 02-10-2016 at 08:55 AM. Reason: Changed to CODE tags for better readability
lac302 is offline   Reply With Quote
Old 02-10-2016, 08:58 AM   #308
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Oh, that's probably fine. You'd have a much higher error rate if this was a Nextera LMP library that had not been correctly preprocesed. As you can see, there is a 0.4% per-base substitution rate for read 1, which is pretty good for such long reads. Read 2 is higher, as expected. It's normal for long reads to have a high fraction containing at least one substitution.
Brian Bushnell is offline   Reply With Quote
Old 02-10-2016, 09:00 AM   #309
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

Is the insert size measured or inferred?
GenoMax is offline   Reply With Quote
Old 02-10-2016, 11:42 AM   #310
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Quote:
Originally Posted by GenoMax View Post
Is the insert size measured or inferred?
measured via fragment analyzer
lac302 is offline   Reply With Quote
Old 02-10-2016, 11:48 AM   #311
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Thanks for the quick reply Brian...Does bbmap have any SNP calling functionality? Or should I just feed these .sam files into the samtools pipeline?
lac302 is offline   Reply With Quote
Old 02-10-2016, 01:02 PM   #312
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBMap has no SNP-calling functionality, so you'll need to use a variant-caller (samtools, GATK, FreeBayes, etc).
Brian Bushnell is offline   Reply With Quote
Old 02-11-2016, 11:30 AM   #313
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

sam files are an accepted input file type for dedupe.sh correct? Or would you recommend removing duplicates from the fastq files?

Code:
[root@g300-149-b0 bbmap]# ./dedupe.sh Um57.sam Um57_nodups.sam 
Max memory cannot be determined.  Attempting to use 3200 MB.
If this fails, please add the argument -Xmx29g (adjusted to roughly 85 percent of physical RAM).
java -Djava.library.path=/app/bbmap/jni/ -ea -Xmx3200m -Xms3200m -cp /app/bbmap/current/ jgi.Dedupe Um57.sam Um57_nodups.sam
Executing jgi.Dedupe [Um57.sam, Um57_nodups.sam]

Initial:
Memory: free=3182m, used=34m

Exception in thread "Thread-3" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-14" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-13" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-11" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-12" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-10" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-9" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-8" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-7" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-5" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-4" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-6" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Found 0 duplicates.
Finished exact matches.    Time: 0.072 seconds.
Memory: free=2930m, used=286m

Found 0 contained sequences.
Finished containment.      Time: 0.006 seconds.
Memory: free=2712m, used=504m

Removed 0 invalid entries.
Finished invalid removal.  Time: 0.001 seconds.
Memory: free=2712m, used=504m

Input:                  	12 reads 		3295 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	0 reads (0.00%) 	0 bases (0.00%)    	0 collisions.
Result:                 	0 reads (0.00%) 	3295 bases (100.00%)

Printed output.            Time: 0.003 seconds.
Memory: free=2695m, used=521m

Time:   			0.089 seconds.
Reads Processed:          12 	0.14k reads/sec
Bases Processed:        3295 	0.04m bases/sec
[root@g300-149-b0 bbmap]
lac302 is offline   Reply With Quote
Old 02-11-2016, 11:34 AM   #314
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

Quote:
Originally Posted by lac302 View Post
sam files are an accepted input file type for dedupe.sh correct? Or would you recommend removing duplicates from the fastq files?
Dedupe can't use sam files.

Code:
Input may be fasta or fastq, compressed or uncompressed.
GenoMax is offline   Reply With Quote
Old 02-11-2016, 01:16 PM   #315
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

If you want to deduplicate based on mapping, I recommend converting to bam, sorting, and using some other program (Picard, for example). Or you can use dedupebymapping.sh which DOES accept sam files. It's much faster than Picard but does not fully take advantage of sorting, or write temp files, so it uses a lot of memory (no more than Dedupe, though).

Dedupe is for deduplicating by sequence. So, whether you deduplicate the fastq then map, or map and deduplicate the sam, depends on your goal - the results will differ slightly. In practice, it won't usually matter much for paired reads, but single-ended reads are much more impacted by deduplication methodology.

What kind of data are you using, and what's your experiment?
Brian Bushnell is offline   Reply With Quote
Old 02-12-2016, 10:09 AM   #316
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Quote:
Originally Posted by GenoMax View Post
Dedupe can't use sam files.
Perhaps I'm using an older version of bbmap but usage shows:
Code:
Usage:	dedupe.sh in=<file or stdin> out=<file or stdout>

An example of running Dedupe for clustering short reads:
dedupe.sh in=x.fq am ac fo c rnc=f mcs=4 mo=100 s=1 pto cc qin=33 csf=stats.txt pattern=cluster_%.fq dot=graph.dot

Input may be stdin or a fasta, fastq, or sam file, compressed or uncompressed.
Output may be stdout or a file.  With no output parameter, data will be written to stdout.
If 'out=null', there will be no output, but statistics will still be printed.
You can also use 'dedupe <infile> <outfile>' without the 'in=' and 'out='.
Quote:
Originally Posted by Brian Bushnell View Post
If you want to deduplicate based on mapping, I recommend converting to bam, sorting, and using some other program (Picard, for example). Or you can use dedupebymapping.sh which DOES accept sam files. It's much faster than Picard but does not fully take advantage of sorting, or write temp files, so it uses a lot of memory (no more than Dedupe, though).

Dedupe is for deduplicating by sequence. So, whether you deduplicate the fastq then map, or map and deduplicate the sam, depends on your goal - the results will differ slightly. In practice, it won't usually matter much for paired reads, but single-ended reads are much more impacted by deduplication methodology.

What kind of data are you using, and what's your experiment?
WGS, 2x275 PE reads from a Nextera XT library. Looking at novel non-syn SNPS found in treatment group vs WT.

Last edited by lac302; 02-12-2016 at 10:45 AM.
lac302 is offline   Reply With Quote
Old 02-12-2016, 10:36 AM   #317
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by lac302 View Post
Perhaps I'm using an older version of bbmap but usage shows:
[CODE]Usage: dedupe.sh in=<file or stdin> out=<file or stdout>

An example of running Dedupe for clustering short reads:
dedupe.sh in=x.fq am ac fo c rnc=f mcs=4 mo=100 s=1 pto cc qin=33 csf=stats.txt pattern=cluster_%.fq dot=graph.dot

Input may be stdin or a fasta, fastq, or sam file, compressed or uncompressed.
Yep, that would be an older version, before I fixed that comment. Dedupe has never been able to process sam files, but I didn't realize it when I first wrote it, since I had never tried it. The problem is that BBTools Read objects have a single special field that can hold attachments; processing sam input uses that special field to hold the original sam line, but Dedupe needs it for storing other information.

Quote:
WGS, 2x275 PE reads from a Nextera XT library. Looking at novel non-syn SNPS found in treatment group vs WT.
You can dedupe the fastqs before mapping or the sam files (by mapping) after mapping, then. Probably best to deduplicate by mapping in this case because 2x275 bp reads will have a high error rate so deduplication by sequence will be less effective and concentrate reads with errors.
Brian Bushnell is offline   Reply With Quote
Old 02-12-2016, 10:39 AM   #318
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

Thanks for the quick reply Brain. I'll download the updated version.
lac302 is offline   Reply With Quote
Old 02-24-2016, 07:41 AM   #319
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default incomplete removal of singletons by BBNORM

Brian (or others),

I am still a bit disturbed by the inability to remove reads that contain singleton kmers. I understand that BBNORM generates an output file that has all reads that contain kmers between specific values, and that in doing so the output file still may contain reads with singleton kmers- thus the khist runs of output BBNORM files still may have a large number of singletons present. However, it seems that it should be possible to generate a list of all reads that contain singleton kmers and mask or delete them from the output - thus eliminating any fragment that contains any low-quality kmers. Note being a programer, is there a reason why this can not (or should not) be done? It seems to me that eliminating such reads might greatly simplify the graphs during any subsequent assembly.
horvathdp is offline   Reply With Quote
Old 02-24-2016, 09:36 PM   #320
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can remove all the reads with any 1-count kmers using BBNorm if you really want to, by messing with some of the parameters, but that's not really what it's designed for. I should probably make that more straightforward. It's easier to do like this:

Code:
kcompress.sh in=file.fq out=kmers.fa minprob=0 mincount=1 maxcount=1
bbduk.sh in=file.fq out=filtered.fq ref=kmers.fa k=31 mm=f
However, there will still be singleton kmers. There will always be singleton kmers, because when you remove reads with singleton kmers, that creates new singleton kmers. For example, if a read has 50 1-count kmers and 50 2-count kmers, then after you remove it, you eliminate 50 1-count kmers but generate 50 new 1-count kmers. You can reduce them, but they never go away.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO