Unconfigured Ad
Collapse
X
-
You can do this with either mapping (BBMap/BBSplit) or kmer-filtering (BBDuk/Seal); the result should be the same. BBDuk will be faster though.Originally posted by willd View PostDear Brian,
I have many reads from a very specific scRNA contaminant of length 299 bp. Would BBDuk's contaminant filtering be a suitable tool for removing these, and what parameters would you recommend? (my reads are 100 bp PE)
Would it be possible to do it in one run along with adapter and quality trimming, e.g. by adding this 299bp sequence to the current adapters.fa file that you have provided in the resources folder?
Many thanks for any help
Assuming the reads are named "read1.fq" and "read2.fq", and your 299bp sequence is in sequence.fa:
bbduk.sh in=read#.fq out=filtered#.fq ref=sequence.fa k=31
You can adjust the sensitivity/specificity with other flags (like adding hdist=1, to catch the very-low-quality reads) but the command above should be adequate. It's technically possible to do this at the same time as adapter-trimming by adding the sequence to the adapters file; reads of that sequence should end up fully trimmed and thus will be discarded because they are too short (shorter than minlen which is 10 by default, I think). But you would achieve higher sensitivity and specificity by filtering in a different pass, so I'd recommend doing it in 2 steps.
Comment
-
-
Dear Brian - thanks very much!
In case it's of interest: I tried both 1 pass (contaminant filtering at same time as adapter-trimming) and 2 pass (adapter trimming then contaminant filtering) on two of my samples, and the output was close to identical: the number of remaining reads was ~0.4% lower for 1 pass v 2 pass (for both samples); differences in plots of per-base sequence quality, GC curves, etc. were almost unnoticeable.
Comment
-
-
Dear Brian, for one fastq file I got the error output below. FastQC did not show any particular problems with this file. The error refers to a specific read - could there be a problem for BBDuk to parse this read?
Code:Executing jgi.BBDukF [in1=../../C467_RE_B00GGOO_8_1_C68T3ACXX.IND1.fastq.gz, in2=../../C467_RE_B00GGOO_8_2_C68T3ACXX.IND1.fastq.gz, out1=B00GGOO_1_TrimPass1.fastq, out2=B00GGOO_2_TrimPass1.fastq, minlen=25, qtrim=r, trimq=10, ktrim=r, k=25, mink=11, ref=my_adapters.fa, hdist=1, tpe, tbo] BBDuk version 35.82 maskMiddle was disabled because useShortKmers=true Initial: Memory: max=777m, free=754m, used=23m Added 64944 kmers; time: 0.212 seconds. Memory: max=777m, free=717m, used=60m Input is being processed as paired Started output streams: 0.176 seconds. java.lang.AssertionError: Error in ../../C467_RE_B00GGOO_8_1_C68T3ACXX.IND1.fastq.gz, line 1210999, with these 4 lines: CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA + CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################ @HISEQ10:C68T3ACXX:8:1101:19859:76022/1 at stream.FASTQ.quadToRead(FASTQ.java:722) at stream.FASTQ.toReadList(FASTQ.java:653) at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111) at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96) at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656) at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635) Exception in thread "Thread-19" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Exception in thread "Thread-20" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Exception in thread "Thread-18" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Exception in thread "Thread-23" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Exception in thread "Thread-21" Exception in thread "Thread-24" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Exception in thread "Thread-22" java.lang.NullPointerException at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399) Processing time: 3.675 seconds. Input: 603200 reads 60923200 bases. QTrimmed: 67817 reads (11.24%) 3329362 bases (5.46%) KTrimmed: 52402 reads (8.69%) 1521198 bases (2.50%) Trimmed by overlap: 36830 reads (6.11%) 190222 bases (0.31%) Result: 575826 reads (95.46%) 55882418 bases (91.73%) Time: 4.073 seconds. Reads Processed: 603k 148.08k reads/sec Bases Processed: 60923k 14.96m bases/sec Exception in thread "main" java.lang.RuntimeException: jgi.BBDukF terminated in an error state; the output may be corrupt. at jgi.BBDukF.process(BBDukF.java:822) at jgi.BBDukF.main(BBDukF.java:70)
Comment
-
-
It looks like the fastq file is corrupt and is missing a line, or something similar.Originally posted by willd View PostDear Brian, for one fastq file I got the error output below. FastQC did not show any particular problems with this file. The error refers to a specific read - could there be a problem for BBDuk to parse this read?
should be more likeCode:CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA + CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################ @HISEQ10:C68T3ACXX:8:1101:19859:76022/1#ID Median_fold
Try looking at the file in that region, by piping zcat to grep around the term "@HISEQ10:C68T3ACXX:8:1101:19859:76022/1" with plus and minus 10 lines, then posting the output.Code:@HISEQ10:C68T3ACXX:8:1101:19859:76022/1#ID Median_fold CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA + CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
Actually, "#ID Median_fold" should not be in there at all. It looks like you somehow got unrelated text mixed into your fastq file, which is what is corrupting it.
Comment
-
-
Dear Brian, indeed there seems to be a "@HISEQ....." line missing in this region of the file, and I found that this has happened in at least one other part of the file as well. I guess I will check with the platform who generated this if they have another version of the file, though the md5sum checks out so I guess this sample might be lost if we can't trust it.
Anyway... NO PROBLEM WITH YOUR EXCELLENT BBDuk! (sorry to have suggested that, and thanks again for your help).
(PS: the "#ID Median_fold" that you mentioned was not in my output)Code:@HISEQ10:C68T3ACXX:8:1101:20187:76019/1 ATTTATTTATTTTATTATTTTTTTGAAACATTTTTACATGTTTATTTATCAAAAGTGAAAAAGCATACAATTCAACACAATTTCAATAGCATTTGACACCA + CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJFHIJJJJJIGGIJJHHHHGHEDFFDEEEEEEEEDCDEDDEEDDDBB @HISEQ10:C68T3ACXX:8:1101:20097:76043/1 GGTTAGTATAGCTTAGTTAAACTTTCGTTTATTGCTAAAGGTTAATCACTGCTGTTTCCCGTGGGGGTGTGGCTAGGCTAAGCGTTTTGAGCTGCATTGCT + @@?DDFADFHHHHJIJHGIJJIJJIJIFHIJGJIAC,=?B01:19795:76016/1 CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA + CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################ @HISEQ10:C68T3ACXX:8:1101:19859:76022/1 GTACATTTAACCCAGTTTAGTGGCAAGTTCTTTAGCCTTTGCCTTTTCGAGCTTGGCGATACGAGCCACAGACTTAGGACCCAGGACGTTGCCACCCCAGT + C@BFFFFFHHHHHJJHIJJIJIJIJJJHIIIIJIJIJJJJJJIIIJJJJIJIIJJJJJJIJIJHHHFFFEEEEEDCCCCDDDDDDDDDDDDDDDDDDDDD> @HISEQ10:C68T3ACXX:8:1101:19768:76027/1 ACACGCTGGACTCCCCTGGCCATTTCTGGGTTATATAAATGTTTGATGAAAAATCTCTCAGGGGAAATCAAATTTTTGTAAGAAGTTAATGACAGCCCATC + @@CFFFFFHHHHHJJJJJJDIJJJJJJIIJDHIJIIFIIJJGHIJIJJGIFIBHJJJJJHIIJJGHFHHFFFFFF;CDEEEDDDDCCDEEDDDDDDDDDDC @HISEQ10:C68T3ACXX:8:1101:19877:76028/1 AGCCATCCAACCACTCAGTCTTGGCAGTGCAGATGAAAAACTGGGAACCATTTGTGTTGGATCGGAAGAGCACACGACTGAACTCAAGTCACATCACGATC + ;;?DDDDDFFFAFD3C@:<+C<CBEEG@9CFFCAFGI9DD)?F<2)9?1D3??4).8)==C@)'5C1=263?;@@<',;8:@B(5(5>>:,>AABBB####
Comment
-
-
Filtering different contaminants with different k-mer sizes
Dear Brian,
I use BBDuk for filtering reads based on quality, adapter content, and chloroplast content. It is a great tool!
For some applications, I also want to filter out reads that contain a restriction enzyme motif (a sequence of length 4-8).
Currently, this leads to up to four BBduk jobs that are connected by a pipe:
My question is:Code:bbduk.sh in=in_R1.fastq in2=in_R2.fastq removeifeitherbad=t out=stdout.fq ref=restriction_motifs.fa k=4 rcomp=t maskmiddle=f qout=33 | bbduk.sh in=stdin.fq int=t removeifeitherbad=t out=stdout.fq minlength=30 qout=33 ref=adapters.fa hammingdistance=2 k=20 mink=10 hammingdistance2=1 ktrim=n maxns=0 maxbadkmers=0 rcomp=t | bbduk.sh in=stdin.fq int=t removeifeitherbad=t out=stdout.fq ref=chloroplast.fa k=31 rcomp=t qout=33 | bbduk.sh in=stdin.fq int=t out1=out_R1.fastq removeifeitherbad=t out2=out_R2.fastq minlength=30 qtrim=r trimq=15 minavgquality=15 qout=33
Can I condense these four jobs into less (ideally one) jobs?
I assume that if a FASTA entry in the file given with "ref=" is smaller than the k-mer size, it is effectively not used for the filtering.
Therefore, the main problem that I see is to define different k-mer sizes for different references (e.g. 4 for the restriction motif, and 31 for the chloroplast).
Thanks!
Chris
Comment
-
-
Hi Chris,
You can merge the last two stages:
And, you don't actually need to set "qout=33 rcomp=t removeifeitherbad=t" since those are all defaults, so the last two stages shorten to:Code:bbduk.sh in=stdin.fq int=t out=out_R#.fastq removeifeitherbad=t minlength=30 qtrim=r trimq=15 minavgquality=15 qout=33 ref=chloroplast.fa k=31 rcomp=t
Unfortunately, you cannot combine stages 1, 2, and 3, as they each need different kmer lengths or hamming distances. So the best way to do it is in a pipeline. Your assumption about ignoring sequences shorter than kmer length is correct, but unfortunately it does not help in this situation.Code:bbduk.sh in=stdin.fq int=t out=out_R#.fastq minlength=30 qtrim=r trimq=15 minavgquality=15 ref=chloroplast.fa k=31
Note that (depending on the environment) BBDuk will try to grab around half of available memory for each process, which may lead to unpredictable results. I suggest using the -Xmx flag in each stage to tell it how much memory to use (-Xmx1g should be fine in each case) whenever piping BBDuk.
The way you accomplished variable-kmer-length adapter-filtering via the "ktrim=n" and "maxns=0" combo was clever, by the way!
Comment
-
-
Some pairs breaking after multiple steps
I've been doing some multi-step trimming of paired libraries that has been causing a low frequency of pairs breaking, and I'm trying to figure out if there's a way to avoid having this happen.
These libraries have a structure where read1 contains exclusively sample barcoding information, and read2 contains the sequences that will eventually be aligned. Ideally, I would like to preserve every base of read1, but on read2, trim several bases from the 5' of the read, as well as do some kmer and quality trimming on read2.
my solution was to do this in two steps, first doing the 5' trim on just read2, then doing the kmer trim with the skipr1 flag active. (I found out the hard way that the skipr1 flag doesn't skip ftl operations)
ie
However, even though there's the same number of reads before and after 5' trimming, and the same number of reads in the final trimmed reads, some of the pairing breaks at a low frequency (I haven't picked out all the broken pairs yet, but the head and tail of the file have the same read names, and the first broken pair is ~2000 reads into the files). Is there a way to run this sort of trimming on only read2 while keeping the pairs properly mated? My other alternative would be to do the trimming on just read2, and then write a custom python script to pull out the correct ordered reads in the read1 fastq.Code:bbduk.sh in=read2.fq.gz ftl=6 out=read2_ftl6.fq.gz bbduk.sh in1=read1.fq in2=read2_ftl6.fq.gz out=read#_trimmed.fq.gz literal=<adapter sequence> ktrim=r k=23 mink=11 hdist=1 skipr1=TRUE
Thanks
Comment
-
-
BBDuk is multithreaded, and will reorder reads unless you explicitly tell it not to via the "ordered" flag. So in this case, you could do this:
It does not matter whether the reads are re-ordered in the second pass, because pairs will be kept together; but when processing just one of the files alone, it's important to retain the original order, and not discard any reads (which is the reason for the minlength=1 flag; the default is 10).Code:bbduk.sh in=read2.fq.gz ftl=6 out=read2_ftl6.fq.gz [U][B]ordered minlength=1[/B][/U] bbduk.sh in1=read1.fq in2=read2_ftl6.fq.gz out=read#_trimmed.fq.gz literal=<adapter sequence> ktrim=r k=23 mink=11 hdist=1 skipr1=TRUE
It's possible to repair the pairing with "repair.sh" but of course it's always most efficient to just avoid the problem in the first place.
Comment
-
-
Hi, I'm trying to trim fasta reads w/ quals but get this error.
Thanks,
Pat
~pminx/bin/bbmap/bbduk.sh in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual ftl=125 ftr=125
error:
java -Djava.library.path=/gscuser/pminx/bin/bbmap/jni/ -ea -Xmx991m -Xms991m -cp /gscuser/pminx/bin/bbmap/current/ jgi.BBDukF in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual ftl=125 ftr=125
Executing jgi.BBDukF [in=Cat_BES_325_500.fasta, qfin=Cat_BES_325_500.qual, out=Cat_BES_325_500_trim.fa, qfout=Cat_BES_325_500_trim.qual, ftl=125, ftr=125]
BBDuk version 36.30
Exception in thread "main" java.lang.RuntimeException: Unknown parameter qfout=Cat_BES_325_500_trim.qual
at jgi.BBDukF.<init>(BBDukF.java:421)
at jgi.BBDukF.main(BBDukF.java:66)
Comment
-
-
Hi Pat,
My apologies, I somehow forgot to include support for qual file output with BBDuk! And, reformat does not support ftl/ftr. I will fix BBDuk's support for qual files in the next release. What you can do now is this (and apologies for it being tedious):
reformat.sh in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=x.fq
bbduk.sh in=x.fq out=y.fq ftl=125 ftr=125
reformat.sh in=y.fq out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual
However, please note that "ftl=125 ftr=125" will trim everything to the left of position 125, and everything to the right of position 125, leaving you with only 1bp reads, which are not very useful. What are you trying to accomplish with this command?
Comment
-
Latest Articles
Collapse
-
by SEQadmin2
Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.
The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...-
Channel: Articles
06-02-2026, 10:05 AM -
-
by SEQadmin2
With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.
Introduction
Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...-
Channel: Articles
05-22-2026, 06:42 AM -
ad_right_rmr
Collapse
News
Collapse
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, 06-05-2026, 10:09 AM
|
0 responses
13 views
0 reactions
|
Last Post
by SEQadmin2
06-05-2026, 10:09 AM
|
||
|
Started by SEQadmin2, 06-04-2026, 08:59 AM
|
0 responses
24 views
0 reactions
|
Last Post
by SEQadmin2
06-04-2026, 08:59 AM
|
||
|
Started by SEQadmin2, 06-02-2026, 12:03 PM
|
0 responses
28 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 12:03 PM
|
||
|
Started by SEQadmin2, 06-02-2026, 11:40 AM
|
0 responses
22 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 11:40 AM
|
Comment