SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Contaminant in dscDNA sample? vanillasky Sample Prep / Library Generation 2 06-19-2014 10:17 PM
Heuristic search for contaminant motifs k-gun12 Bioinformatics 0 03-14-2013 09:50 AM
rRNA contaminant screening aligning small RNA honey Bioinformatics 0 05-31-2012 07:32 PM
18bp DNA contaminant in ChIP input sample? sinapius Sample Prep / Library Generation 0 02-09-2011 11:39 AM
BWA and screening illumina reads for contaminant jmartin Bioinformatics 3 04-22-2010 03:55 PM

Reply
 
Thread Tools
Old 04-17-2014, 01:25 PM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,612
Default Introducing RemoveHuman: Human Contaminant Removal

Contaminated DNA leads to lower-quality results, no matter what study you're doing. One of the most, if not the most, common contaminants in DNA sequencing is human. If you are doing human research, then there's not much you can do about it, but for everyone researching non-vertebrate organisms, read on -

JGI processes mainly plants, fungi, and microbes; never vertebrates. In an attempt to create the cleanest assemblies possible, I wrote a pipeline for removing human contamination. My first naive implementation mapped data to the human reference, and kept only the unmapped reads. However, this caused problems:

1) The human reference (HG19) appears to contain some contaminant sequences from other organisms.
2) Certain features, particularly ribosomes, are so highly conserved at the nucleotide level that humans have 100% identity with even plants and fungi over ~100bp window.
3) There are many low-complexity sequences like ACACACACACTCTCTCTC... that share near-100% identity with many other organisms.

Therefore, this approach yielded false positives, which could result in worse assemblies (because they would break at the places homologous to human). Plus, they'd be missing important features.

So, I created a masked version of HG19 to prevent false positives when removing human contamination. It was masked in 3 ways, using a program I developed called BBMask:

1) Areas containing multiple repeat short kmers.
2) Windows of low entropy, calculated using pentamer frequencies.
3) Via mapping from sam files:
The mapping is the more interesting approach. I used every fungal assembly on JGI's Mycocosm, every plant genome on JGI's phytozome, and a handful of others, including zebra danio (the closest relative to human I included), and the entire silva ribosomal database. First, I completely removed the assemblies that contained human contamination (a handful). Then, I shredded the remaining data into 70-80bp pieces, mapped it to human requiring a minimum of ~85% identity, and used BBMask to mask everything the shreds hit.

This masked a total of under 1.4% of the genome, implying that the remaining sequence should still capture 98.4% of human reads.

To test if it works, I shredded (to 80bp) the wild rice and drosophila references (neither included in my screening) and mapped them to masked human with BBMap, with low sensitivity settings.

1 drosophila shred mapped at 92.5% identity, and 2 wild rice shreds mapped with 93.75% identity. After increasing the shred length to 90, or min identity to 95%, nothing mapped. So I have increased the minimum percent identity to 95%; for reads of 150bp or longer, I don't expect any false positives on non-animals, and probably none on invertebrates.

So, this is the final command line:
bbmap.sh minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/path/to/hg19masked/ qtrim=rl trimq=10 untrim -Xmx23g in=reads.fq outu=clean.fq outm=human.fq

You first have to index the reference, like this:
bbmap.sh ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz -Xmx23g

The masked reference (880 MB) is available here:
https://drive.google.com/file/d/0B3l...it?usp=sharing

I hope this will help others! Feedback is welcome. BBMask is very flexible, too, if you want to do your own masking on other organisms.

Last edited by Brian Bushnell; 10-29-2015 at 08:37 PM.
Brian Bushnell is offline   Reply With Quote
Old 04-18-2014, 01:33 AM   #2
tcezard
Member
 
Location: Edinburgh

Join Date: Dec 2008
Posts: 13
Default

Hi Brian,
That looks very interesting.
Did you do any test with Phix.
We routinely remove Phix read by aligning all reads the the reference.
Obviously Phix is very small so the chance of false positive is much smaller but I've not seen this tested.
Especially when using long read and more sensitive aligner like bwa mem.

Cheers
tcezard is offline   Reply With Quote
Old 04-18-2014, 10:20 AM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,612
Default

I use BBDuk to remove phiX, with 31mers and a max hamming distance of 1. phiX is so tiny that I don't expect any false positives - I have never found any in my synthetic tests with those settings. Obviously, it might be a different matter if you were specifically studying viruses.

bbduk.sh -Xmx1g in=reads.fq out=clean.fq ref=phix.fasta k=31 hdist=1

You could of course use the same procedure I used with human to mask phix, and map to it rather than doing kmer-based removal, but that's probably a waste of time.
Brian Bushnell is offline   Reply With Quote
Old 07-03-2014, 12:05 PM   #4
muol
Member
 
Location: USA

Join Date: Jun 2012
Posts: 10
Default

Hi Brian,

Would you recommend to use BBDuk for pre-filtering reads from microbiome samples? I thought about cleaning out phiX contaminants as described above, followed by extracting 16S rDNA reads by mapping against the qiime rep_set (~ 400k rDNAs) and retrieving matching reads via outmatch=xxx, hdist=0, k=31.

Thanks
Olaf
muol is offline   Reply With Quote
Old 07-03-2014, 12:41 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,612
Default

Olaf,

BBDuk is great for removal of known sequences. As for specifically isolating rDNA, I expect it would catch most of it, but I'm not sure about the highly variable regions (it depends on how long and how variable they are). If the insert size is long enough so that at least one of the reads is usually expected to be in a highly conserved region then it should work fine.

I would suggest filtering as you describe above, but also catching the non-matching reads with "outu=xxx". Then you can map those against the database with a low identity requirement and see if anything was missed.

BBMap also supports the "outm" and "outu" flags. In this case I don't know whether mapping or kmer-filtering would do better. You could even do both, first running BBDuk then mapping the leftovers and merging the resultant files:

bbduk.sh in=reads.fq outm=matched.fq outu=unmatched.fq ref=ribo.fa k=31

bbmap.sh in=unmatched.fq ref=ribo.fa nodisk outm=matched2.fq outu=unmatched2.fq maxindel=20 minid=0.7

cat matched.fq matched2.fq > combined.fq


I guess it depends on whether you are more worried about false positives or false negatives. Oh, and mapping might be incredibly slow on that kind of reference where a read can align to 400k different places equally well.
Brian Bushnell is offline   Reply With Quote
Old 07-03-2014, 12:59 PM   #6
muol
Member
 
Location: USA

Join Date: Jun 2012
Posts: 10
Default

Thanks,
the phiX filtering step worked well. It reported 5% phiX content, which is pretty close to our requested 4% spike for this run.

The second step mapped 99.7% of the filtered reads to 16S, which would be fantastic.

Olaf
muol is offline   Reply With Quote
Old 11-22-2015, 01:53 AM   #7
Naarkhoo
Member
 
Location: Unja

Join Date: Jan 2013
Posts: 11
Default indexing the reference

I have indexed the referenced as you described . Now a folder, "ref" is generated which seems only include until chromosome 7 ... is it how it should be ?! ((is it a right place to ask these sort of questions?)
Naarkhoo is offline   Reply With Quote
Old 11-22-2015, 04:08 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,380
Default

Quote:
Originally Posted by Naarkhoo View Post
I have indexed the referenced as you described . Now a folder, "ref" is generated which seems only include until chromosome 7 ... is it how it should be ?! ((is it a right place to ask these sort of questions?)
The "ref" folder stores the indexes in its own unique way in "genome" and "index" directories. You can't just look in those folder to see what is there nor depend on the file names you see there.

For example in hg19 BBMap index there are these files on my system:

Code:
chr1-3_index_k13_c2_b1.block
chr1-3_index_k13_c2_b1.block2.gz
chr4-7_index_k13_c2_b1.block
chr4-7_index_k13_c2_b1.block2.gz
Have you tried doing a masking run and found results only from chromosome 7?
GenoMax is offline   Reply With Quote
Old 08-15-2016, 09:52 AM   #9
robalba1
Junior Member
 
Location: St Louis

Join Date: May 2012
Posts: 4
Default Masked ref for plant organelles?

Hi Brian,

I am looking for a good way to ensure all mito- and chloro-genome reads are removed from my read data prior to a de novo assembly. Any chance that you have already generated a masked reference file for the genomes from plant mitochondria or chloroplasts?

Rob
robalba1 is offline   Reply With Quote
Old 08-15-2016, 09:59 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,380
Default

You can create them yourself using bbmask.sh. Not sure if you would need to if you are just looking to remove reads mapping to mito and chloroplast.

I assume you have seen BBsplit, which can be used for this purpose.

Quote:
Description: Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads.
By default this program will mask using entropy with a window=80 and entropy=0.75

Usage: bbmask.sh in=<file> out=<file> sam=<file,file,...file>

Input may be stdin or a fasta or fastq file, raw or gzipped.[

window=80 (w) Window size for entropy calculation.
entropy=0.70 (e) Mask windows with entropy under this value (0-1). 0.0001 will mask only homopolymers and 1 will mask everything.

Last edited by GenoMax; 08-15-2016 at 10:05 AM.
GenoMax is offline   Reply With Quote
Old 08-15-2016, 10:27 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,612
Default

And, sorry, but nope - I have not. Chloroplasts are one thing (and I don't know much about them), but mito is another... there are a lot of genes that can move back and forth between mito and the host organism.

I find the best way to handle this is de-novo assembly chloroplast and mitochondria independently based on depth-filtering, then you can pull out the reads mapping to those asemblies prior to main-genome assembly. Both should be present at much higher depth than the rest of the genome.

For assembling mitochondria from fungal genomes I wrote a script like this:

Code:
kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt

primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))

bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999
reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45

kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2

mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`

upper=$((mitopeak * 6 / 3))
lower=$((mitopeak * 3 / 7))
mcs=$((mitopeak * 3 / 4))
mincov=$((mitopeak * 2 / 3))

tadpole.sh in=highpass_gc.fq.gz out=contigs100.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=100

bbduk.sh in=highpass.fq.gz ref=contigs100.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05

tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6

ln -s contigs_bbd.fa contigs.fa
Note that this is for 1050bp reads; for shorter ones you may need shorter kmers. And you may need to adjust the GC cutoff for chloroplasts as well. Also, fungi are simpler as they are haploid or diploid and only have mitochondria rather than chloroplasts also.

Last edited by Brian Bushnell; 08-15-2016 at 10:30 AM.
Brian Bushnell is offline   Reply With Quote
Old 12-07-2016, 04:30 PM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,612
Default

I've uploaded a few more files to the Google drive:

https://drive.google.com/file/d/0B3l...ew?usp=sharing
https://drive.google.com/file/d/0B3l...ew?usp=sharing
https://drive.google.com/file/d/0B3l...ew?usp=sharing
https://drive.google.com/file/d/0B3l...ew?usp=sharing
https://drive.google.com/file/d/0B3l...ew?usp=sharing

Those are masked versions of the cat, dog, and mouse genomes. I also added two of bacteria:

fusedEPmasked2.fa.gz
fusedERPBBmasked2.fa.gz

Those are common contaminant microbes that we encounter in sequencing. For Eukaryotic genomes, I suggest mapping against fusedEPmasked2, in which the bacteria are masked for entropy and plastids (e.g. chloroplast) only. The other one (fusedERPBBmasked2) is intended for Prokaryotic assembly and is masked for conserved regions in bacteria, including ribosomes. If you want to use it for filtering out common laboratory/human/reagent-associated microbes, it's useful to ensure that your bacteria of interest is not on the list If you know what organism you are sequencing, you can use the tool "filterbytaxa.sh" to create a filtered version that file after removing all sequences from organisms in the same family, like this:

Code:
filterbytaxa.sh in=fusedERPBBmasked2.fa.gz out=taxfiltered.fa.gz include=f ids=1234 tree=tree.taxtree.gz
...where "1234" is the NCBI ID of the organism and tree.taxtree.gz is made from NCBI's taxdump like this:

Code:
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
taxtree.sh names.dmp nodes.dmp tree.taxtree.gz
Brian Bushnell is offline   Reply With Quote
Old 12-08-2016, 05:53 AM   #13
vingomez
Member
 
Location: USA

Join Date: Sep 2014
Posts: 18
Default

Hi Brian,


First, thanks for your continuous effort.


Quote:
Those are common contaminant microbes that we encounter in sequencing
Just for curiosity, How do you (or the center) compiled this list of common bacteria contaminants?

Thanks again
Vicente

Last edited by vingomez; 12-08-2016 at 06:03 AM. Reason: Deleted sentence -mistake in names of files
vingomez is offline   Reply With Quote
Old 12-08-2016, 08:56 AM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,612
Default

Hi Vicente,

Every project we sequence gets BLASTed to nt/nr and various RefSeq databases. All of the non-target hits are tracked. Then, a few months ago, someone manually went through and examined the non-target hits to make a list of the ones that commonly occurred.

Then, I took the list and expanded it slightly to include other microbes which have very high identity to the microbes on the list. For example, E.coli was detected as a common contaminant; but Shigella and Klebsiella are 100% identical to E.coli over large portions of the genome (they are basically strains of E.coli), meaning there is no way to ensure a Shigella library is uncontaminated, for example; and it's difficult to ensure that our BLAST hits to E.coli were not, in fact, Shigella or Klebsiella. So, the final list is 35 microbes plus Lambda phage which is a contaminant in one of our reagents (and shares sequence with E.coli so they are hard to distinguish), but many of them are just different strains (3 strains of Pseudomonas fluorescens, for example). They are generally either human-associated (like E.coli) or associated with laboratories or reagents (like, again, E.coli).
Brian Bushnell is offline   Reply With Quote
Old 07-14-2017, 02:19 PM   #15
ybukhman
Junior Member
 
Location: Madison, WI, USA

Join Date: Jan 2011
Posts: 4
Default

Hi Brian,

is it appropriate to use the bacterial contaminant files, fusedEPmasked2.fa.gz
and/or fusedERPBBmasked2.fa.gz, to filter soil metagenomics reads?

Thank you for your very useful tools and discussions!

Yury
ybukhman is offline   Reply With Quote
Old 07-14-2017, 02:46 PM   #16
fanli
Senior Member
 
Location: California

Join Date: Jul 2014
Posts: 196
Default

I would caution against blindly filtering your metagenomic data against any database of contaminants. Ideally, you would have negative controls run alongside your samples that could be checked for the presence of these contaminants instead....
fanli is offline   Reply With Quote
Old 07-14-2017, 04:01 PM   #17
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,612
Default

I won't say either one of you is right or wrong, and negative controls are always a good idea. But, the reason I put together the bacterial contaminant file is because JGI is not capable of distinguishing between actual samples, and contaminants in that file. With sufficient amplification (like single cells), some wells may have high levels of a contaminant that is zero in other wells, since it only takes one particle. Some of them, like Pseudomonas, are present in reagents. Others, like E.coli, are present on human skin, and often make their way into the libraries.

I have seen dozens of posters that incorrectly claim Pseudomonas or various other common contaminants are endemic to some environment. But it's likely artifact of poor quality control. So, I encourage you to be very cautious.

*Edit. For reference, JGI no longer sequences anything on that list.

Last edited by Brian Bushnell; 07-17-2017 at 08:58 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-15-2017, 07:15 AM   #18
ybukhman
Junior Member
 
Location: Madison, WI, USA

Join Date: Jan 2011
Posts: 4
Default RE: filtering soil metagenomics data

Thank you, fanli and Brian! I guess one possibility is not to filter initially, but then check the final assembly against the contaminant files just to find out if some of the species that I am detecting are known contaminants?
ybukhman is offline   Reply With Quote
Old 07-17-2017, 08:56 AM   #19
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,612
Default

That's certainly a good possibility. If you have zero or trivial amounts of common contaminants, don't bother filtering. If you have a lot... then, try to figure out whether you have an exact strain match, which greatly boosts the likelihood of it originating in a reagent.
Brian Bushnell is offline   Reply With Quote
Old 07-17-2017, 12:45 PM   #20
ybukhman
Junior Member
 
Location: Madison, WI, USA

Join Date: Jan 2011
Posts: 4
Default

Thanks a lot, Brian! I'll give it a try.
ybukhman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:35 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO