![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Contaminant in dscDNA sample? | vanillasky | Sample Prep / Library Generation | 2 | 06-19-2014 11:17 PM |
Heuristic search for contaminant motifs | k-gun12 | Bioinformatics | 0 | 03-14-2013 10:50 AM |
rRNA contaminant screening aligning small RNA | honey | Bioinformatics | 0 | 05-31-2012 08:32 PM |
18bp DNA contaminant in ChIP input sample? | sinapius | Sample Prep / Library Generation | 0 | 02-09-2011 12:39 PM |
BWA and screening illumina reads for contaminant | jmartin | Bioinformatics | 3 | 04-22-2010 04:55 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Contaminated DNA leads to lower-quality results, no matter what study you're doing. One of the most, if not the most, common contaminants in DNA sequencing is human. If you are doing human research, then there's not much you can do about it, but for everyone researching non-vertebrate organisms, read on -
JGI processes mainly plants, fungi, and microbes; never vertebrates. In an attempt to create the cleanest assemblies possible, I wrote a pipeline for removing human contamination. My first naive implementation mapped data to the human reference, and kept only the unmapped reads. However, this caused problems: 1) The human reference (HG19) appears to contain some contaminant sequences from other organisms. 2) Certain features, particularly ribosomes, are so highly conserved at the nucleotide level that humans have 100% identity with even plants and fungi over ~100bp window. 3) There are many low-complexity sequences like ACACACACACTCTCTCTC... that share near-100% identity with many other organisms. Therefore, this approach yielded false positives, which could result in worse assemblies (because they would break at the places homologous to human). Plus, they'd be missing important features. So, I created a masked version of HG19 to prevent false positives when removing human contamination. It was masked in 3 ways, using a program I developed called BBMask: 1) Areas containing multiple repeat short kmers. 2) Windows of low entropy, calculated using pentamer frequencies. 3) Via mapping from sam files: The mapping is the more interesting approach. I used every fungal assembly on JGI's Mycocosm, every plant genome on JGI's phytozome, and a handful of others, including zebra danio (the closest relative to human I included), and the entire silva ribosomal database. First, I completely removed the assemblies that contained human contamination (a handful). Then, I shredded the remaining data into 70-80bp pieces, mapped it to human requiring a minimum of ~85% identity, and used BBMask to mask everything the shreds hit. This masked a total of under 1.4% of the genome, implying that the remaining sequence should still capture 98.4% of human reads. To test if it works, I shredded (to 80bp) the wild rice and drosophila references (neither included in my screening) and mapped them to masked human with BBMap, with low sensitivity settings. 1 drosophila shred mapped at 92.5% identity, and 2 wild rice shreds mapped with 93.75% identity. After increasing the shred length to 90, or min identity to 95%, nothing mapped. So I have increased the minimum percent identity to 95%; for reads of 150bp or longer, I don't expect any false positives on non-animals, and probably none on invertebrates. So, this is the final command line: bbmap.sh minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/path/to/hg19masked/ qtrim=rl trimq=10 untrim -Xmx23g in=reads.fq outu=clean.fq outm=human.fq You first have to index the reference, like this: bbmap.sh ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz -Xmx23g The masked reference (880 MB) is available here: https://drive.google.com/file/d/0B3l...it?usp=sharing I hope this will help others! Feedback is welcome. BBMask is very flexible, too, if you want to do your own masking on other organisms. Last edited by Brian Bushnell; 10-29-2015 at 09:37 PM. |
![]() |
![]() |
![]() |
#2 |
Member
Location: Edinburgh Join Date: Dec 2008
Posts: 13
|
![]()
Hi Brian,
That looks very interesting. Did you do any test with Phix. We routinely remove Phix read by aligning all reads the the reference. Obviously Phix is very small so the chance of false positive is much smaller but I've not seen this tested. Especially when using long read and more sensitive aligner like bwa mem. Cheers |
![]() |
![]() |
![]() |
#3 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I use BBDuk to remove phiX, with 31mers and a max hamming distance of 1. phiX is so tiny that I don't expect any false positives - I have never found any in my synthetic tests with those settings. Obviously, it might be a different matter if you were specifically studying viruses.
bbduk.sh -Xmx1g in=reads.fq out=clean.fq ref=phix.fasta k=31 hdist=1 You could of course use the same procedure I used with human to mask phix, and map to it rather than doing kmer-based removal, but that's probably a waste of time. |
![]() |
![]() |
![]() |
#4 |
Member
Location: USA Join Date: Jun 2012
Posts: 10
|
![]()
Hi Brian,
Would you recommend to use BBDuk for pre-filtering reads from microbiome samples? I thought about cleaning out phiX contaminants as described above, followed by extracting 16S rDNA reads by mapping against the qiime rep_set (~ 400k rDNAs) and retrieving matching reads via outmatch=xxx, hdist=0, k=31. Thanks Olaf |
![]() |
![]() |
![]() |
#5 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Olaf,
BBDuk is great for removal of known sequences. As for specifically isolating rDNA, I expect it would catch most of it, but I'm not sure about the highly variable regions (it depends on how long and how variable they are). If the insert size is long enough so that at least one of the reads is usually expected to be in a highly conserved region then it should work fine. I would suggest filtering as you describe above, but also catching the non-matching reads with "outu=xxx". Then you can map those against the database with a low identity requirement and see if anything was missed. BBMap also supports the "outm" and "outu" flags. In this case I don't know whether mapping or kmer-filtering would do better. You could even do both, first running BBDuk then mapping the leftovers and merging the resultant files: bbduk.sh in=reads.fq outm=matched.fq outu=unmatched.fq ref=ribo.fa k=31 bbmap.sh in=unmatched.fq ref=ribo.fa nodisk outm=matched2.fq outu=unmatched2.fq maxindel=20 minid=0.7 cat matched.fq matched2.fq > combined.fq I guess it depends on whether you are more worried about false positives or false negatives. Oh, and mapping might be incredibly slow on that kind of reference where a read can align to 400k different places equally well. |
![]() |
![]() |
![]() |
#6 |
Member
Location: USA Join Date: Jun 2012
Posts: 10
|
![]()
Thanks,
the phiX filtering step worked well. It reported 5% phiX content, which is pretty close to our requested 4% spike for this run. The second step mapped 99.7% of the filtered reads to 16S, which would be fantastic. Olaf |
![]() |
![]() |
![]() |
#7 |
Member
Location: Unja Join Date: Jan 2013
Posts: 11
|
![]()
I have indexed the referenced as you described . Now a folder, "ref" is generated which seems only include until chromosome 7 ... is it how it should be ?! ((is it a right place to ask these sort of questions?)
|
![]() |
![]() |
![]() |
#8 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,082
|
![]() Quote:
For example in hg19 BBMap index there are these files on my system: Code:
chr1-3_index_k13_c2_b1.block chr1-3_index_k13_c2_b1.block2.gz chr4-7_index_k13_c2_b1.block chr4-7_index_k13_c2_b1.block2.gz |
|
![]() |
![]() |
![]() |
#9 |
Junior Member
Location: St Louis Join Date: May 2012
Posts: 4
|
![]()
Hi Brian,
I am looking for a good way to ensure all mito- and chloro-genome reads are removed from my read data prior to a de novo assembly. Any chance that you have already generated a masked reference file for the genomes from plant mitochondria or chloroplasts? Rob |
![]() |
![]() |
![]() |
#10 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,082
|
![]()
You can create them yourself using bbmask.sh. Not sure if you would need to if you are just looking to remove reads mapping to mito and chloroplast.
I assume you have seen BBsplit, which can be used for this purpose. Quote:
Last edited by GenoMax; 08-15-2016 at 11:05 AM. |
|
![]() |
![]() |
![]() |
#11 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
And, sorry, but nope - I have not. Chloroplasts are one thing (and I don't know much about them), but mito is another... there are a lot of genes that can move back and forth between mito and the host organism.
I find the best way to handle this is de-novo assembly chloroplast and mitochondria independently based on depth-filtering, then you can pull out the reads mapping to those asemblies prior to main-genome assembly. Both should be present at much higher depth than the rest of the genome. For assembling mitochondria from fungal genomes I wrote a script like this: Code:
kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"` cutoff=$(( $primary * 3 )) bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999 reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45 kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2 mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"` upper=$((mitopeak * 6 / 3)) lower=$((mitopeak * 3 / 7)) mcs=$((mitopeak * 3 / 4)) mincov=$((mitopeak * 2 / 3)) tadpole.sh in=highpass_gc.fq.gz out=contigs100.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=100 bbduk.sh in=highpass.fq.gz ref=contigs100.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05 tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6 ln -s contigs_bbd.fa contigs.fa Last edited by Brian Bushnell; 08-15-2016 at 11:30 AM. |
![]() |
![]() |
![]() |
#12 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I've uploaded a few more files to the Google drive:
https://drive.google.com/file/d/0B3l...ew?usp=sharing https://drive.google.com/file/d/0B3l...ew?usp=sharing https://drive.google.com/file/d/0B3l...ew?usp=sharing https://drive.google.com/file/d/0B3l...ew?usp=sharing https://drive.google.com/file/d/0B3l...ew?usp=sharing Those are masked versions of the cat, dog, and mouse genomes. I also added two of bacteria: fusedEPmasked2.fa.gz fusedERPBBmasked2.fa.gz Those are common contaminant microbes that we encounter in sequencing. For Eukaryotic genomes, I suggest mapping against fusedEPmasked2, in which the bacteria are masked for entropy and plastids (e.g. chloroplast) only. The other one (fusedERPBBmasked2) is intended for Prokaryotic assembly and is masked for conserved regions in bacteria, including ribosomes. If you want to use it for filtering out common laboratory/human/reagent-associated microbes, it's useful to ensure that your bacteria of interest is not on the list ![]() Code:
filterbytaxa.sh in=fusedERPBBmasked2.fa.gz out=taxfiltered.fa.gz include=f ids=1234 tree=tree.taxtree.gz Code:
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip unzip taxdmp.zip taxtree.sh names.dmp nodes.dmp tree.taxtree.gz |
![]() |
![]() |
![]() |
#13 | |
Member
Location: USA Join Date: Sep 2014
Posts: 18
|
![]()
Hi Brian,
First, thanks for your continuous effort. Quote:
Thanks again Vicente Last edited by vingomez; 12-08-2016 at 07:03 AM. Reason: Deleted sentence -mistake in names of files |
|
![]() |
![]() |
![]() |
#14 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Hi Vicente,
Every project we sequence gets BLASTed to nt/nr and various RefSeq databases. All of the non-target hits are tracked. Then, a few months ago, someone manually went through and examined the non-target hits to make a list of the ones that commonly occurred. Then, I took the list and expanded it slightly to include other microbes which have very high identity to the microbes on the list. For example, E.coli was detected as a common contaminant; but Shigella and Klebsiella are 100% identical to E.coli over large portions of the genome (they are basically strains of E.coli), meaning there is no way to ensure a Shigella library is uncontaminated, for example; and it's difficult to ensure that our BLAST hits to E.coli were not, in fact, Shigella or Klebsiella. So, the final list is 35 microbes plus Lambda phage which is a contaminant in one of our reagents (and shares sequence with E.coli so they are hard to distinguish), but many of them are just different strains (3 strains of Pseudomonas fluorescens, for example). They are generally either human-associated (like E.coli) or associated with laboratories or reagents (like, again, E.coli). |
![]() |
![]() |
![]() |
#15 |
Junior Member
Location: Madison, WI, USA Join Date: Jan 2011
Posts: 6
|
![]()
Hi Brian,
is it appropriate to use the bacterial contaminant files, fusedEPmasked2.fa.gz and/or fusedERPBBmasked2.fa.gz, to filter soil metagenomics reads? Thank you for your very useful tools and discussions! Yury |
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: California Join Date: Jul 2014
Posts: 198
|
![]()
I would caution against blindly filtering your metagenomic data against any database of contaminants. Ideally, you would have negative controls run alongside your samples that could be checked for the presence of these contaminants instead....
|
![]() |
![]() |
![]() |
#17 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I won't say either one of you is right or wrong, and negative controls are always a good idea. But, the reason I put together the bacterial contaminant file is because JGI is not capable of distinguishing between actual samples, and contaminants in that file. With sufficient amplification (like single cells), some wells may have high levels of a contaminant that is zero in other wells, since it only takes one particle. Some of them, like Pseudomonas, are present in reagents. Others, like E.coli, are present on human skin, and often make their way into the libraries.
I have seen dozens of posters that incorrectly claim Pseudomonas or various other common contaminants are endemic to some environment. But it's likely artifact of poor quality control. So, I encourage you to be very cautious. *Edit. For reference, JGI no longer sequences anything on that list. Last edited by Brian Bushnell; 07-17-2017 at 09:58 AM. |
![]() |
![]() |
![]() |
#18 |
Junior Member
Location: Madison, WI, USA Join Date: Jan 2011
Posts: 6
|
![]()
Thank you, fanli and Brian! I guess one possibility is not to filter initially, but then check the final assembly against the contaminant files just to find out if some of the species that I am detecting are known contaminants?
|
![]() |
![]() |
![]() |
#19 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
That's certainly a good possibility. If you have zero or trivial amounts of common contaminants, don't bother filtering. If you have a lot... then, try to figure out whether you have an exact strain match, which greatly boosts the likelihood of it originating in a reagent.
|
![]() |
![]() |
![]() |
#20 |
Junior Member
Location: Madison, WI, USA Join Date: Jan 2011
Posts: 6
|
![]()
Thanks a lot, Brian! I'll give it a try.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|