Unconfigured Ad

**tcezard** · 04-18-2014, 01:33 AM

Hi Brian,
That looks very interesting.
Did you do any test with Phix.
We routinely remove Phix read by aligning all reads the the reference.
Obviously Phix is very small so the chance of false positive is much smaller but I've not seen this tested.
Especially when using long read and more sensitive aligner like bwa mem.

Cheers

**Brian Bushnell** · 04-18-2014, 10:20 AM

I use BBDuk to remove phiX, with 31mers and a max hamming distance of 1. phiX is so tiny that I don't expect any false positives - I have never found any in my synthetic tests with those settings. Obviously, it might be a different matter if you were specifically studying viruses.

bbduk.sh -Xmx1g in=reads.fq out=clean.fq ref=phix.fasta k=31 hdist=1

You could of course use the same procedure I used with human to mask phix, and map to it rather than doing kmer-based removal, but that's probably a waste of time.

**muol** · 07-03-2014, 12:05 PM

Hi Brian,

Would you recommend to use BBDuk for pre-filtering reads from microbiome samples? I thought about cleaning out phiX contaminants as described above, followed by extracting 16S rDNA reads by mapping against the qiime rep_set (~ 400k rDNAs) and retrieving matching reads via outmatch=xxx, hdist=0, k=31.

Thanks
Olaf

**Brian Bushnell** · 07-03-2014, 12:41 PM

Olaf,

BBDuk is great for removal of known sequences. As for specifically isolating rDNA, I expect it would catch most of it, but I'm not sure about the highly variable regions (it depends on how long and how variable they are). If the insert size is long enough so that at least one of the reads is usually expected to be in a highly conserved region then it should work fine.

I would suggest filtering as you describe above, but also catching the non-matching reads with "outu=xxx". Then you can map those against the database with a low identity requirement and see if anything was missed.

BBMap also supports the "outm" and "outu" flags. In this case I don't know whether mapping or kmer-filtering would do better. You could even do both, first running BBDuk then mapping the leftovers and merging the resultant files:

bbduk.sh in=reads.fq outm=matched.fq outu=unmatched.fq ref=ribo.fa k=31

bbmap.sh in=unmatched.fq ref=ribo.fa nodisk outm=matched2.fq outu=unmatched2.fq maxindel=20 minid=0.7

cat matched.fq matched2.fq > combined.fq

I guess it depends on whether you are more worried about false positives or false negatives. Oh, and mapping might be incredibly slow on that kind of reference where a read can align to 400k different places equally well.

**muol** · 07-03-2014, 12:59 PM

Thanks,
the phiX filtering step worked well. It reported 5% phiX content, which is pretty close to our requested 4% spike for this run.

The second step mapped 99.7% of the filtered reads to 16S, which would be fantastic.

Olaf

**Naarkhoo** · 11-22-2015, 02:53 AM

indexing the reference

I have indexed the referenced as you described . Now a folder, "ref" is generated which seems only include until chromosome 7 ... is it how it should be ?! ((is it a right place to ask these sort of questions?)

**GenoMax** · 11-22-2015, 05:08 AM

Originally posted by Naarkhoo View Post

I have indexed the referenced as you described . Now a folder, "ref" is generated which seems only include until chromosome 7 ... is it how it should be ?! ((is it a right place to ask these sort of questions?)

The "ref" folder stores the indexes in its own unique way in "genome" and "index" directories. You can't just look in those folder to see what is there nor depend on the file names you see there.

For example in hg19 BBMap index there are these files on my system:

Code:

chr1-3_index_k13_c2_b1.block
chr1-3_index_k13_c2_b1.block2.gz
chr4-7_index_k13_c2_b1.block
chr4-7_index_k13_c2_b1.block2.gz

Have you tried doing a masking run and found results only from chromosome 7?

**robalba1** · 08-15-2016, 09:52 AM

Masked ref for plant organelles?

Hi Brian,

I am looking for a good way to ensure all mito- and chloro-genome reads are removed from my read data prior to a de novo assembly. Any chance that you have already generated a masked reference file for the genomes from plant mitochondria or chloroplasts?

Rob

**GenoMax** · 08-15-2016, 09:59 AM

You can create them yourself using bbmask.sh. Not sure if you would need to if you are just looking to remove reads mapping to mito and chloroplast.

I assume you have seen BBsplit, which can be used for this purpose.

Description: Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads.
By default this program will mask using entropy with a window=80 and entropy=0.75

Usage: bbmask.sh in=<file> out=<file> sam=<file,file,...file>

Input may be stdin or a fasta or fastq file, raw or gzipped.[

window=80 (w) Window size for entropy calculation.
entropy=0.70 (e) Mask windows with entropy under this value (0-1). 0.0001 will mask only homopolymers and 1 will mask everything.

**Brian Bushnell** · 08-15-2016, 10:27 AM

And, sorry, but nope - I have not. Chloroplasts are one thing (and I don't know much about them), but mito is another... there are a lot of genes that can move back and forth between mito and the host organism.

I find the best way to handle this is de-novo assembly chloroplast and mitochondria independently based on depth-filtering, then you can pull out the reads mapping to those asemblies prior to main-genome assembly. Both should be present at much higher depth than the rest of the genome.

For assembling mitochondria from fungal genomes I wrote a script like this:

Code:

kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt

primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
cutoff=$(( $primary * 3 ))

bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999
reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45

kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2

mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`

upper=$((mitopeak * 6 / 3))
lower=$((mitopeak * 3 / 7))
mcs=$((mitopeak * 3 / 4))
mincov=$((mitopeak * 2 / 3))

tadpole.sh in=highpass_gc.fq.gz out=contigs100.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=100

bbduk.sh in=highpass.fq.gz ref=contigs100.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05

tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6

ln -s contigs_bbd.fa contigs.fa

Note that this is for 1050bp reads; for shorter ones you may need shorter kmers. And you may need to adjust the GC cutoff for chloroplasts as well. Also, fungi are simpler as they are haploid or diploid and only have mitochondria rather than chloroplasts also.

**Brian Bushnell** · 12-07-2016, 05:30 PM

I've uploaded a few more files to the Google drive:

https://drive.google.com/file/d/0B3llHR93L14wTHdWRG55c2hPUXM/view?usp=sharing

https://drive.google.com/file/d/0B3llHR93L14wYmJYNm9EbkhMVHM/view?usp=sharing

https://drive.google.com/file/d/0B3llHR93L14wOXJhWXRlZjBpVUU/view?usp=sharing

https://drive.google.com/file/d/0B3llHR93L14wNkxnSk0wOUZubk0/view?usp=sharing

https://drive.google.com/file/d/0B3llHR93L14wZ1N6akxrSW16Z0U/view?usp=sharing

Those are masked versions of the cat, dog, and mouse genomes. I also added two of bacteria:

fusedEPmasked2.fa.gz
fusedERPBBmasked2.fa.gz

Those are common contaminant microbes that we encounter in sequencing. For Eukaryotic genomes, I suggest mapping against fusedEPmasked2, in which the bacteria are masked for entropy and plastids (e.g. chloroplast) only. The other one (fusedERPBBmasked2) is intended for Prokaryotic assembly and is masked for conserved regions in bacteria, including ribosomes. If you want to use it for filtering out common laboratory/human/reagent-associated microbes, it's useful to ensure that your bacteria of interest is not on the list

If you know what organism you are sequencing, you can use the tool "filterbytaxa.sh" to create a filtered version that file after removing all sequences from organisms in the same family, like this:

Code:

filterbytaxa.sh in=fusedERPBBmasked2.fa.gz out=taxfiltered.fa.gz include=f ids=1234 tree=tree.taxtree.gz

...where "1234" is the NCBI ID of the organism and tree.taxtree.gz is made from NCBI's taxdump like this:

Code:

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
taxtree.sh names.dmp nodes.dmp tree.taxtree.gz

**vingomez** · 12-08-2016, 06:53 AM

Hi Brian,

First, thanks for your continuous effort.

Those are common contaminant microbes that we encounter in sequencing

Just for curiosity, How do you (or the center) compiled this list of common bacteria contaminants?

Thanks again
Vicente

**Brian Bushnell** · 12-08-2016, 09:56 AM

Hi Vicente,

Every project we sequence gets BLASTed to nt/nr and various RefSeq databases. All of the non-target hits are tracked. Then, a few months ago, someone manually went through and examined the non-target hits to make a list of the ones that commonly occurred.

Then, I took the list and expanded it slightly to include other microbes which have very high identity to the microbes on the list. For example, E.coli was detected as a common contaminant; but Shigella and Klebsiella are 100% identical to E.coli over large portions of the genome (they are basically strains of E.coli), meaning there is no way to ensure a Shigella library is uncontaminated, for example; and it's difficult to ensure that our BLAST hits to E.coli were not, in fact, Shigella or Klebsiella. So, the final list is 35 microbes plus Lambda phage which is a contaminant in one of our reagents (and shares sequence with E.coli so they are hard to distinguish), but many of them are just different strains (3 strains of Pseudomonas fluorescens, for example). They are generally either human-associated (like E.coli) or associated with laboratories or reagents (like, again, E.coli).

**ybukhman** · 07-14-2017, 02:19 PM

Hi Brian,

is it appropriate to use the bacterial contaminant files, fusedEPmasked2.fa.gz
and/or fusedERPBBmasked2.fa.gz, to filter soil metagenomics reads?

Thank you for your very useful tools and discussions!

Yury

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 99 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 119 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 112 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

Introducing RemoveHuman: Human Contaminant Removal

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News