Contaminated DNA leads to lower-quality results, no matter what study you're doing. One of the most, if not the most, common contaminants in DNA sequencing is human. If you are doing human research, then there's not much you can do about it, but for everyone researching non-vertebrate organisms, read on -
JGI processes mainly plants, fungi, and microbes; never vertebrates. In an attempt to create the cleanest assemblies possible, I wrote a pipeline for removing human contamination. My first naive implementation mapped data to the human reference, and kept only the unmapped reads. However, this caused problems:
1) The human reference (HG19) appears to contain some contaminant sequences from other organisms.
2) Certain features, particularly ribosomes, are so highly conserved at the nucleotide level that humans have 100% identity with even plants and fungi over ~100bp window.
3) There are many low-complexity sequences like ACACACACACTCTCTCTC... that share near-100% identity with many other organisms.
Therefore, this approach yielded false positives, which could result in worse assemblies (because they would break at the places homologous to human). Plus, they'd be missing important features.
So, I created a masked version of HG19 to prevent false positives when removing human contamination. It was masked in 3 ways, using a program I developed called BBMask:
1) Areas containing multiple repeat short kmers.
2) Windows of low entropy, calculated using pentamer frequencies.
3) Via mapping from sam files:
The mapping is the more interesting approach. I used every fungal assembly on JGI's Mycocosm, every plant genome on JGI's phytozome, and a handful of others, including zebra danio (the closest relative to human I included), and the entire silva ribosomal database. First, I completely removed the assemblies that contained human contamination (a handful). Then, I shredded the remaining data into 70-80bp pieces, mapped it to human requiring a minimum of ~85% identity, and used BBMask to mask everything the shreds hit.
This masked a total of under 1.4% of the genome, implying that the remaining sequence should still capture 98.4% of human reads.
To test if it works, I shredded (to 80bp) the wild rice and drosophila references (neither included in my screening) and mapped them to masked human with BBMap, with low sensitivity settings.
1 drosophila shred mapped at 92.5% identity, and 2 wild rice shreds mapped with 93.75% identity. After increasing the shred length to 90, or min identity to 95%, nothing mapped. So I have increased the minimum percent identity to 95%; for reads of 150bp or longer, I don't expect any false positives on non-animals, and probably none on invertebrates.
So, this is the final command line:
bbmap.sh minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/path/to/hg19masked/ qtrim=rl trimq=10 untrim -Xmx23g in=reads.fq outu=clean.fq outm=human.fq
You first have to index the reference, like this:
bbmap.sh ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz -Xmx23g
The masked reference (880 MB) is available here:
I hope this will help others! Feedback is welcome. BBMask is very flexible, too, if you want to do your own masking on other organisms.
JGI processes mainly plants, fungi, and microbes; never vertebrates. In an attempt to create the cleanest assemblies possible, I wrote a pipeline for removing human contamination. My first naive implementation mapped data to the human reference, and kept only the unmapped reads. However, this caused problems:
1) The human reference (HG19) appears to contain some contaminant sequences from other organisms.
2) Certain features, particularly ribosomes, are so highly conserved at the nucleotide level that humans have 100% identity with even plants and fungi over ~100bp window.
3) There are many low-complexity sequences like ACACACACACTCTCTCTC... that share near-100% identity with many other organisms.
Therefore, this approach yielded false positives, which could result in worse assemblies (because they would break at the places homologous to human). Plus, they'd be missing important features.
So, I created a masked version of HG19 to prevent false positives when removing human contamination. It was masked in 3 ways, using a program I developed called BBMask:
1) Areas containing multiple repeat short kmers.
2) Windows of low entropy, calculated using pentamer frequencies.
3) Via mapping from sam files:
The mapping is the more interesting approach. I used every fungal assembly on JGI's Mycocosm, every plant genome on JGI's phytozome, and a handful of others, including zebra danio (the closest relative to human I included), and the entire silva ribosomal database. First, I completely removed the assemblies that contained human contamination (a handful). Then, I shredded the remaining data into 70-80bp pieces, mapped it to human requiring a minimum of ~85% identity, and used BBMask to mask everything the shreds hit.
This masked a total of under 1.4% of the genome, implying that the remaining sequence should still capture 98.4% of human reads.
To test if it works, I shredded (to 80bp) the wild rice and drosophila references (neither included in my screening) and mapped them to masked human with BBMap, with low sensitivity settings.
1 drosophila shred mapped at 92.5% identity, and 2 wild rice shreds mapped with 93.75% identity. After increasing the shred length to 90, or min identity to 95%, nothing mapped. So I have increased the minimum percent identity to 95%; for reads of 150bp or longer, I don't expect any false positives on non-animals, and probably none on invertebrates.
So, this is the final command line:
bbmap.sh minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=/path/to/hg19masked/ qtrim=rl trimq=10 untrim -Xmx23g in=reads.fq outu=clean.fq outm=human.fq
You first have to index the reference, like this:
bbmap.sh ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz -Xmx23g
The masked reference (880 MB) is available here:
I hope this will help others! Feedback is welcome. BBMask is very flexible, too, if you want to do your own masking on other organisms.
Comment