View Full Version : Introducing RemoveHuman: Human Contaminant Removal

Brian Bushnell
04-17-2014, 01:25 PM
Contaminated DNA leads to lower-quality results, no matter what study you're doing. One of the most, if not the most, common contaminants in DNA sequencing is human. If you are doing human research, then there's not much you can do about it, but for everyone researching non-vertebrate organisms, read on -

JGI processes mainly plants, fungi, and microbes; never vertebrates. In an attempt to create the cleanest assemblies possible, I wrote a pipeline for removing human contamination. My first naive implementation mapped data to the human reference, and kept only the unmapped reads. However, this caused problems:

1) The human reference (HG19) appears to contain some contaminant sequences from other organisms.
2) Certain features, particularly ribosomes, are so highly conserved at the nucleotide level that humans have 100% identity with even plants and fungi over ~100bp window.
3) There are many low-complexity sequences like ACACACACACTCTCTCTC... that share near-100% identity with many other organisms.

Therefore, this approach yielded false positives, which could result in worse assemblies (because they would break at the places homologous to human). Plus, they'd be missing important features.

So, I created a masked version of HG19 to prevent false positives when removing human contamination. It was masked in 3 ways, using a program I developed called BBMask (sourceforge.net/projects/bbmap/):

1) Areas containing multiple repeat short kmers.
2) Windows of low entropy, calculated using pentamer frequencies.
3) Via mapping from sam files:
The mapping is the more interesting approach. I used every fungal assembly on JGI's Mycocosm, every plant genome on JGI's phytozome, and a handful of others, including zebra danio (the closest relative to human I included), and the entire silva ribosomal database. First, I completely removed the assemblies that contained human contamination (a handful). Then, I shredded the remaining data into 70-80bp pieces, mapped it to human requiring a minimum of ~85% identity, and used BBMask to mask everything the shreds hit.

This masked a total of under 1.4% of the genome, implying that the remaining sequence should still capture 98.4% of human reads.

To test if it works, I shredded (to 80bp) the wild rice and drosophila references (neither included in my screening) and mapped them to masked human with BBMap, with low sensitivity settings.

1 drosophila shred mapped at 92.5% identity, and 2 wild rice shreds mapped with 93.75% identity. After increasing the shred length to 90, or min identity to 95%, nothing mapped. So I have increased the minimum percent identity to 95%; for reads of 150bp or longer, I don't expect any false positives on non-animals, and probably none on invertebrates.

So, this is the final command line:
bbmap.sh minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/path/to/hg19masked/ qtrim=rl trimq=10 untrim -Xmx23g in=reads.fq outu=clean.fq outm=human.fq

You first have to index the reference, like this:
bbmap.sh ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz -Xmx23g

The masked reference (880 MB) is available here:

I hope this will help others! Feedback is welcome. BBMask is very flexible, too, if you want to do your own masking on other organisms.

04-18-2014, 01:33 AM
Hi Brian,
That looks very interesting.
Did you do any test with Phix.
We routinely remove Phix read by aligning all reads the the reference.
Obviously Phix is very small so the chance of false positive is much smaller but I've not seen this tested.
Especially when using long read and more sensitive aligner like bwa mem.


Brian Bushnell
04-18-2014, 10:20 AM
I use BBDuk to remove phiX, with 31mers and a max hamming distance of 1. phiX is so tiny that I don't expect any false positives - I have never found any in my synthetic tests with those settings. Obviously, it might be a different matter if you were specifically studying viruses.

bbduk.sh -Xmx1g in=reads.fq out=clean.fq ref=phix.fasta k=31 hdist=1

You could of course use the same procedure I used with human to mask phix, and map to it rather than doing kmer-based removal, but that's probably a waste of time.

07-03-2014, 12:05 PM
Hi Brian,

Would you recommend to use BBDuk for pre-filtering reads from microbiome samples? I thought about cleaning out phiX contaminants as described above, followed by extracting 16S rDNA reads by mapping against the qiime rep_set (~ 400k rDNAs) and retrieving matching reads via outmatch=xxx, hdist=0, k=31.


Brian Bushnell
07-03-2014, 12:41 PM

BBDuk is great for removal of known sequences. As for specifically isolating rDNA, I expect it would catch most of it, but I'm not sure about the highly variable regions (it depends on how long and how variable they are). If the insert size is long enough so that at least one of the reads is usually expected to be in a highly conserved region then it should work fine.

I would suggest filtering as you describe above, but also catching the non-matching reads with "outu=xxx". Then you can map those against the database with a low identity requirement and see if anything was missed.

BBMap also supports the "outm" and "outu" flags. In this case I don't know whether mapping or kmer-filtering would do better. You could even do both, first running BBDuk then mapping the leftovers and merging the resultant files:

bbduk.sh in=reads.fq outm=matched.fq outu=unmatched.fq ref=ribo.fa k=31

bbmap.sh in=unmatched.fq ref=ribo.fa nodisk outm=matched2.fq outu=unmatched2.fq maxindel=20 minid=0.7

cat matched.fq matched2.fq > combined.fq

I guess it depends on whether you are more worried about false positives or false negatives. Oh, and mapping might be incredibly slow on that kind of reference where a read can align to 400k different places equally well.

07-03-2014, 12:59 PM
the phiX filtering step worked well. It reported 5% phiX content, which is pretty close to our requested 4% spike for this run.

The second step mapped 99.7% of the filtered reads to 16S, which would be fantastic.


11-22-2015, 01:53 AM
I have indexed the referenced as you described . Now a folder, "ref" is generated which seems only include until chromosome 7 ... is it how it should be ?! ((is it a right place to ask these sort of questions?)

11-22-2015, 04:08 AM
I have indexed the referenced as you described . Now a folder, "ref" is generated which seems only include until chromosome 7 ... is it how it should be ?! ((is it a right place to ask these sort of questions?)

The "ref" folder stores the indexes in its own unique way in "genome" and "index" directories. You can't just look in those folder to see what is there nor depend on the file names you see there.

For example in hg19 BBMap index there are these files on my system:


Have you tried doing a masking run and found results only from chromosome 7?

08-15-2016, 09:52 AM
Hi Brian,

I am looking for a good way to ensure all mito- and chloro-genome reads are removed from my read data prior to a de novo assembly. Any chance that you have already generated a masked reference file for the genomes from plant mitochondria or chloroplasts?


08-15-2016, 09:59 AM
You can create them yourself using bbmask.sh. Not sure if you would need to if you are just looking to remove reads mapping to mito and chloroplast.

I assume you have seen BBsplit (http://seqanswers.com/forums/showthread.php?t=41288), which can be used for this purpose.

Description: Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads.
By default this program will mask using entropy with a window=80 and entropy=0.75

Usage: bbmask.sh in=<file> out=<file> sam=<file,file,...file>

Input may be stdin or a fasta or fastq file, raw or gzipped.[

window=80 (w) Window size for entropy calculation.
entropy=0.70 (e) Mask windows with entropy under this value (0-1). 0.0001 will mask only homopolymers and 1 will mask everything.

Brian Bushnell
08-15-2016, 10:27 AM
And, sorry, but nope - I have not. Chloroplasts are one thing (and I don't know much about them), but mito is another... there are a lot of genes that can move back and forth between mito and the host organism.

I find the best way to handle this is de-novo assembly chloroplast and mitochondria independently based on depth-filtering, then you can pull out the reads mapping to those asemblies prior to main-genome assembly. Both should be present at much higher depth than the rest of the genome.

For assembling mitochondria from fungal genomes I wrote a script like this:

kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt

primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))

bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999
reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45

kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2

mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`

upper=$((mitopeak * 6 / 3))
lower=$((mitopeak * 3 / 7))
mcs=$((mitopeak * 3 / 4))
mincov=$((mitopeak * 2 / 3))

tadpole.sh in=highpass_gc.fq.gz out=contigs100.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=100

bbduk.sh in=highpass.fq.gz ref=contigs100.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05

tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6

ln -s contigs_bbd.fa contigs.fa

Note that this is for 1050bp reads; for shorter ones you may need shorter kmers. And you may need to adjust the GC cutoff for chloroplasts as well. Also, fungi are simpler as they are haploid or diploid and only have mitochondria rather than chloroplasts also.

Brian Bushnell
12-07-2016, 04:30 PM
I've uploaded a few more files to the Google drive:


Those are masked versions of the cat, dog, and mouse genomes. I also added two of bacteria:


Those are common contaminant microbes that we encounter in sequencing. For Eukaryotic genomes, I suggest mapping against fusedEPmasked2, in which the bacteria are masked for entropy and plastids (e.g. chloroplast) only. The other one (fusedERPBBmasked2) is intended for Prokaryotic assembly and is masked for conserved regions in bacteria, including ribosomes. If you want to use it for filtering out common laboratory/human/reagent-associated microbes, it's useful to ensure that your bacteria of interest is not on the list :) If you know what organism you are sequencing, you can use the tool "filterbytaxa.sh" to create a filtered version that file after removing all sequences from organisms in the same family, like this:

filterbytaxa.sh in=fusedERPBBmasked2.fa.gz out=taxfiltered.fa.gz include=f ids=1234 tree=tree.taxtree.gz

...where "1234" is the NCBI ID of the organism and tree.taxtree.gz is made from NCBI's taxdump like this:

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
taxtree.sh names.dmp nodes.dmp tree.taxtree.gz

12-08-2016, 05:53 AM
Hi Brian,

First, thanks for your continuous effort.

Those are common contaminant microbes that we encounter in sequencing

Just for curiosity, How do you (or the center) compiled this list of common bacteria contaminants?

Thanks again

Brian Bushnell
12-08-2016, 08:56 AM
Hi Vicente,

Every project we sequence gets BLASTed to nt/nr and various RefSeq databases. All of the non-target hits are tracked. Then, a few months ago, someone manually went through and examined the non-target hits to make a list of the ones that commonly occurred.

Then, I took the list and expanded it slightly to include other microbes which have very high identity to the microbes on the list. For example, E.coli was detected as a common contaminant; but Shigella and Klebsiella are 100% identical to E.coli over large portions of the genome (they are basically strains of E.coli), meaning there is no way to ensure a Shigella library is uncontaminated, for example; and it's difficult to ensure that our BLAST hits to E.coli were not, in fact, Shigella or Klebsiella. So, the final list is 35 microbes plus Lambda phage which is a contaminant in one of our reagents (and shares sequence with E.coli so they are hard to distinguish), but many of them are just different strains (3 strains of Pseudomonas fluorescens, for example). They are generally either human-associated (like E.coli) or associated with laboratories or reagents (like, again, E.coli).

07-14-2017, 02:19 PM
Hi Brian,

is it appropriate to use the bacterial contaminant files, fusedEPmasked2.fa.gz
and/or fusedERPBBmasked2.fa.gz, to filter soil metagenomics reads?

Thank you for your very useful tools and discussions!


07-14-2017, 02:46 PM
I would caution against blindly filtering your metagenomic data against any database of contaminants. Ideally, you would have negative controls run alongside your samples that could be checked for the presence of these contaminants instead....

Brian Bushnell
07-14-2017, 04:01 PM
I won't say either one of you is right or wrong, and negative controls are always a good idea. But, the reason I put together the bacterial contaminant file is because JGI is not capable of distinguishing between actual samples, and contaminants in that file. With sufficient amplification (like single cells), some wells may have high levels of a contaminant that is zero in other wells, since it only takes one particle. Some of them, like Pseudomonas, are present in reagents. Others, like E.coli, are present on human skin, and often make their way into the libraries.

I have seen dozens of posters that incorrectly claim Pseudomonas or various other common contaminants are endemic to some environment. But it's likely artifact of poor quality control. So, I encourage you to be very cautious.

*Edit. For reference, JGI no longer sequences anything on that list.

07-15-2017, 07:15 AM
Thank you, fanli and Brian! I guess one possibility is not to filter initially, but then check the final assembly against the contaminant files just to find out if some of the species that I am detecting are known contaminants?

Brian Bushnell
07-17-2017, 08:56 AM
That's certainly a good possibility. If you have zero or trivial amounts of common contaminants, don't bother filtering. If you have a lot... then, try to figure out whether you have an exact strain match, which greatly boosts the likelihood of it originating in a reagent.

07-17-2017, 12:45 PM
Thanks a lot, Brian! I'll give it a try.

yang zhang
08-27-2017, 11:23 AM
Hi Brain,

This might be a silly question, but do we need to index the reference every time for every sample?


08-28-2017, 03:10 AM
Hi Brain,

This might be a silly question, but do we need to index the reference every time for every sample?


No. You can create an index of the reference up-front by doing bbmap.sh in=ref.fa that will create a "ref" directory, which will contain all index files. Do not worry about the contents of the folder since they are arranged in a way BBMap requires them.

In future when you want to use this index replace "ref=" with "path=/path_to_directory_containing_ref_folder" in your command line.

yang zhang
08-28-2017, 07:53 AM
I see. Thank you for the clarification, GenoMax!:)

08-11-2018, 08:18 PM
Hi Brian,

Just reviving an old thread here. I have been testing out a lot of different methods to clean human reads and I really love BBMap because it's such a well thought-out program. However, when I try to clean human reads with the settings you have specified, I routinely get a ton of reads remaining - upwards of 70% (so only 30% are cleaned). I have tried to adjust the various parameters, but the only thing that seems to make a difference for depletion is the 'minid' setting. Setting that at 0.50 (which is *very* low) depletes around 95% of reads. As a comparison, a default run with bwa mem depletes 100%.

Any idea how I might get BBMap to more accurately deplete human reads?

08-12-2018, 05:01 AM
Have you tried using "bbsplit.sh" with human genome to see if that works better. If you are interested in non-human data then I would use the non-masked genome and risk losing a few additional reads.

08-12-2018, 08:33 AM
I have not - however, bbsplit doesn't really seem to be the right tool for removing human reads?

08-12-2018, 03:27 PM
"bbsplit.sh" is a general purpose tool that will bin reads into any number of bins (depending on the reference sequences provided, you can provide as many as you want). In this case you would provide human_genome.fa (and any other reference you want to use). If you only use human then reads not mapping to human genome will be collected in other bin.

06-11-2019, 06:54 PM
Hello Brian,

I am trying to use removehuman.sh on MSU HPCC.

Inputs (*_filtered.fastq.gz files) are phix filtered R1 and R2 files using BBduk as following: [leejooy5@dev-intel18 filtered_reads]$ bbduk.sh -Xmx10g in1=NFW_R1_trimmed.fastq.gz in2=NFW_R2_trimmed.fastq.gz out1=NFW_R1_filtered.fastq.gz out2=NFW_R2_filtered.fastq.gz ref=/opt/software/BBMap/37.93-foss-2018a/resources/phix174_ill.ref.fa.gz k=31 hdist=1 stats=GR25_stats.txt threadS=8)

As shown in below, error message "Exception in thread "main" java.lang.RuntimeException: Can't find file /global/projectb/sandbox/gaag/bbtools/hg19/ref/genome/1/summary.txt
" popped up when I tried to run "removehuman.sh". I tried without additional parameters such as -Xmx and threads, but same error happened. Also, I tried to find the find the
file "/global/projectb/sandbox/gaag/bbtools/hg19/ref/genome/1/summary.txt", but I couldn't. Could you tell me what mistake I did or let me know where I can find a solution? Thank you for your time and consideration.


[leejooy5@dev-intel18 filtered_reads]$ removehuman.sh -Xmx10g in1=NFW_R1_filtered.fastq.gz in2=NFW_R2_filtered.fastq.gz out1=NFW_R1_clean.fastq.gz out2=NFW_R2_clean.fastq.gz threads=8

removehuman.sh -Xmx10g in1=NFW_R1_filtered.fastq.gz in2=NFW_R2_filtered.fastq.gz out1=NFW_R1_clean.fastq.gz out2=NFW_R2_clean.fastq.gz threads=8
java -Djava.library.path=/opt/software/BBMap/37.93-foss-2018a/jni/ -ea -Xmx10g -cp /opt/software/BBMap/37.93-foss-2018a/current/ align2.BBMap minratio=0.9 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/global/projectb/sandbox/gaag/bbtools/hg19 pigz unpigz zl=6 qtrim=r trimq=10 untrim idtag usemodulo printunmappedcount usejni ztd=2 kfilter=25 maxsites=1 k=14 -Xmx10g in1=NFW_R1_filtered.fastq.gz in2=NFW_R2_filtered.fastq.gz out1=NFW_R1_clean.fastq.gz out2=NFW_R2_clean.fastq.gz threads=8
Executing align2.BBMap [tipsearch=20, maxindel=80, minhits=2, bwr=0.18, bw=40, minratio=0.65, midpad=150, minscaf=50, quickmatch=t, rescuemismatches=15, rescuedist=800, maxsites=3, maxsites2=100, minratio=0.9, maxindel=3, bwr=0.16, bw=12, quickmatch, minhits=2, path=/global/projectb/sandbox/gaag/bbtools/hg19, pigz, unpigz, zl=6, qtrim=r, trimq=10, untrim, idtag, usemodulo, printunmappedcount, usejni, ztd=2, kfilter=25, maxsites=1, k=14, -Xmx10g, in1=NFW_R1_filtered.fastq.gz, in2=NFW_R2_filtered.fastq.gz, out1=NFW_R1_clean.fastq.gz, out2=NFW_R2_clean.fastq.gz, threads=8]
Version 37.93 [tipsearch=20, maxindel=80, minhits=2, bwr=0.18, bw=40, minratio=0.65, midpad=150, minscaf=50, quickmatch=t, rescuemismatches=15, rescuedist=800, maxsites=3, maxsites2=100, minratio=0.9, maxindel=3, bwr=0.16, bw=12, quickmatch, minhits=2, path=/global/projectb/sandbox/gaag/bbtools/hg19, pigz, unpigz, zl=6, qtrim=r, trimq=10, untrim, idtag, usemodulo, printunmappedcount, usejni, ztd=2, kfilter=25, maxsites=1, k=14, -Xmx10g, in1=NFW_R1_filtered.fastq.gz, in2=NFW_R2_filtered.fastq.gz, out1=NFW_R1_clean.fastq.gz, out2=NFW_R2_clean.fastq.gz, threads=8]

Set threads to 8
Retaining first best site only for ambiguous mappings.
Exception in thread "main" java.lang.RuntimeException: Can't find file /global/projectb/sandbox/gaag/bbtools/hg19/ref/genome/1/summary.txt
at fileIO.ReadWrite.getRawInputStream(ReadWrite.java:906)
at fileIO.ReadWrite.getInputStream(ReadWrite.java:871)
at fileIO.TextFile.open(TextFile.java:227)
at fileIO.TextFile.<init>(TextFile.java:71)
at dna.Data.setGenome2(Data.java:822)
at dna.Data.setGenome(Data.java:768)
at align2.BBMap.loadIndex(BBMap.java:313)
at align2.BBMap.main(BBMap.java:32)

06-12-2019, 07:13 AM
@jylee: "/global/projectb/sandbox/gaag/bbtools/hg19/ref/genome/1/summary.txt" appears to refer to a location on JGI servers (if that is not your own). You will need to download and provide hg19 reference sequence. You can pre-index the genome with BBMap to use with path= or use ref= option to point to the genome sequence multi-fasta file location.