SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bowtie with humongous reference database rtyagi Bioinformatics 4 07-06-2015 05:13 AM
Blasting contigs against reference database cyanoevo Bioinformatics 4 01-27-2015 04:54 AM
snpEff Reference Genome Database Pepper_and_Tomato Bioinformatics 0 07-23-2012 01:23 AM
How can I estimate overall coverage against a reference database? dacotahm Bioinformatics 1 11-22-2011 04:01 PM
Super Large Reference Genome PatrickReed Bioinformatics 3 10-12-2011 10:07 AM

Reply
 
Thread Tools
Old 01-28-2016, 12:45 AM   #1
danova
Member
 
Location: France

Join Date: Sep 2010
Posts: 27
Default bbduk with a large reference database

Hi,
I would like to check for contaminants using both phiX and the human genome. My data is metagenomics data and i want to remove any read mapping to both phiX and the Human genome.

So far bbduk can handle this by using the ref=phiX.fa
However for checking contaminations from human samples i would like to ust the non redundant nucleotide database. It is split into small pieces and usually i access them through blast using the reference nt.nal file.

Is that is also feasible with bbduk ??
danova is offline   Reply With Quote
Old 01-28-2016, 01:03 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

I don't completely understand what you mean by "i would like to use the non redundant nucleotide database" to remove contamination from human samples. It may still be easier to do what you have been doing (separate human reads from other stuff).

You should be able to use BBSplit or seal, which can accept a folder of references. Whether BBSplit can accept a "nr" size folder may need to be experimented with.
GenoMax is online now   Reply With Quote
Old 01-28-2016, 01:25 AM   #3
danova
Member
 
Location: France

Join Date: Sep 2010
Posts: 27
Default

Sorry for the confusion. I was confused with large blast databases (.nal file). bbduk does its own indexing....so no way to use blast index databases.

Which Human database does people mots frequently use to discard human contamintation reads from metagenomes ? I tough to use the nt database (nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) ) ???
danova is offline   Reply With Quote
Old 01-28-2016, 03:00 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by danova View Post
Sorry for the confusion. I was confused with large blast databases (.nal file). bbduk does its own indexing....so no way to use blast index databases.

Which Human database does people mots frequently use to discard human contamintation reads from metagenomes ? I tough to use the nt database (nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) ) ???
Correct - for first question/comment.

You can just use the human genome sequence (multi-fasta concatenated chromosomes in single file, from UCSC/Ensembl/NCBI/iGenomes) with bbduk (or bbsplit). BBSplit may be better since you can bin all sequences that align to human in one file and capture the rest of the data in second output file.
GenoMax is online now   Reply With Quote
Old 01-28-2016, 04:38 AM   #5
danova
Member
 
Location: France

Join Date: Sep 2010
Posts: 27
Default

great iŽll work on that.... combining with bbsplit
thanks
danova is offline   Reply With Quote
Old 01-28-2016, 04:55 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

After using BBDuk for PhiX removal, the protocol JGI uses for human removal is this, with BBMap and a masked human reference. Using BBSplit is strictly better, if you know your intended organism's genome. But, JGI rarely knows that, which is why we are sequencing it

You can download the masked human reference from the link provided. It constitutes around 98% of the human genome. That means some reads will intentionally slip through, in regions that are highly conserved down to early eukaryotes, or those with very low complexity. But, the point is to remove virtually all human contamination with no risk of false positives. If you absolutely need to remove ALL human contamination and don't know the organism's genome, you should use the unmasked reference, and you probably will get some false positive removals.

For assembly of a new organism, I think it is best to remove human contaminants using the above very safe procedure, then assemble, then BLAST the assembly and remove anything long (say, >400bp) that hits human with >98% identity, and hits nothing else other than other primates (typically chimp, gorilla, and orangutan).

Also, note that I do not recommend using nt/nr in any primary decontamination procedure for which you know the possible contaminants (like determining which reads are, specifically, human) - they are incomplete, poorly-curated, and the process becomes extremely slow because they are huge. Rather, using the references (or masked versions of the references) will give you a better signal-to-noise ratio. nt/nr are much better for diagnosing which things may be present than actually removing them.

Since you're doing metagenomics, using an unmasked human genome is probably fine since humans and bacteria are very dissimilar. But, unless you are doing a human-related microbiome, you might consider removing common human-associated microbes such as E.coli and Salmonella. They seem to be anywhere humans are. Masking things like ribosomes is probably prudent if you do this. There are also some others like Delftia and Pseudomonas that seem to be common sequencing contaminants and cause problems with metagenome analysis, as they seem to show up everywhere, even if human-related DNA is not present, and even in single-cell experiments of other species. Anyway, something to consider.
Brian Bushnell is offline   Reply With Quote
Old 01-28-2016, 11:45 PM   #7
danova
Member
 
Location: France

Join Date: Sep 2010
Posts: 27
Default

Thanks Brian,

Thanks for the masked version on Hg19. Do you hava also masked version hg38 ?

Just another quick question. Have you published BBmap or how to cite your software ?
danova is offline   Reply With Quote
Old 01-29-2016, 03:59 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

You can use bbmask.sh from BBMap to create masked version of hg38.

BBMap has not been published yet. In the past @Brian has asked people to cite the project's SourceForge (http://sourceforge.net/projects/bbmap/) website in publications.
GenoMax is online now   Reply With Quote
Old 01-29-2016, 08:56 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I would not worry about HG19 versus HG38 for the purposes of contaminant removal. They mainly differ in their coordinates, not contents.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bbduk

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:05 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO