SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM
tophat/cufflinks with different read lengths libraries linsson RNA Sequencing 0 07-03-2012 11:21 AM
Mate pairs contaminated with paired ends - impact on assembly? reithme Bioinformatics 2 12-14-2009 12:35 AM

Reply
 
Thread Tools
Old 02-27-2014, 09:57 AM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special "ambiguous" file for each of them. Paired reads will always be kept together.

For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq int=t

This will produce 3 output files:
out_ecoli.fq (ecoli reads)
out_salmonella.fq (salmonella reads)
clean.fq (unmapped reads)

In this case, "int=t" means that the input file is paired and interleaved. For single-end reads you would leave that out. For paired reads in 2 files, you would do this:
bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

You can get more information about parameters by running bbsplit.sh with no arguments, or reading /bbmap/docs/readme.txt. But I will mention here the inter-reference ambiguity modes, which decide what to do with reads that map to multiple references and pairs where each read maps to a different reference:

ambig2=best
Default. Ambiguous reads go to the first best site.

ambig2=toss
Ambiguous reads are considered unmapped.

ambig2=all
Write a copy to the output for each reference to which it maps.

ambig2=split
Write a copy to the AMBIGUOUS_ output file for each reference to which it maps.

If your OS cannot process bash shellscripts, replace "bbsplit.sh" with "java -Xmx29g -cp /path/to/current align2.BBSplitter", where /path/to/current is the location of the 'current' directory (a subdirectory of bbmap), and -Xmx29g specifies the amount of memory to use (so this would be the command line for a 32GB computer). This should be set to about 85% of physical memory.

BBSplit is extremely fast and highly sensitive, using BBMap for the mapping. So, all flags and features supported by BBMap can be used with BBSplit (aside from sam output).

BBSplit is available here:
https://sourceforge.net/projects/bbmap/

P.S. Some people have asked why BBSplit has a lower alignment rate than BBMap. That is because it has a lower default sensitivity, as the original intent was to bin reads using known assemblies. The sensitivity can be raised to be equivalent to BBMap with these flags: "minratio=0.56 minhits=1 maxindel=16000"

Last edited by Brian Bushnell; 09-16-2014 at 09:29 AM.
Brian Bushnell is offline   Reply With Quote
Old 09-16-2014, 07:46 AM   #2
vingomez
Member
 
Location: USA

Join Date: Sep 2014
Posts: 18
Default

Thanks Brian for your effort in providing bioinformatic applications,

As you mentioned in your post, BBsplit can use two paired-end files as input. In addition to the two files; Can I add a third file (e.g. merged read file) as input?



Thanks again


P.S. In previous post "java -Xmx29g -cp /path/to/current align2.BBSplit", must said align2.BBSplitter"

Last edited by vingomez; 09-16-2014 at 08:22 AM.
vingomez is offline   Reply With Quote
Old 09-16-2014, 09:32 AM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by vingomez View Post
Thanks Brian for your effort in providing bioinformatic applications,

As you mentioned in your post, BBsplit can use two paired-end files as input. In addition to the two files; Can I add a third file (e.g. merged read file) as input?
No, you'll have to do that in two runs, one for the paired reads and one for the merged reads. But ultimately that won't affect the output or runtime (other than the fact that the index will need to be loaded twice, and you'll end up with 2 sam files that need to be merged).

Quote:
P.S. In previous post "java -Xmx29g -cp /path/to/current align2.BBSplit", must said align2.BBSplitter"
Fixed, thanks!
Brian Bushnell is offline   Reply With Quote
Old 11-10-2014, 01:08 PM   #4
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

After running bbsplit once using the syntax: ref=ref1.fa,ref2.fa, is it possible to re-use that index on subsequent runs using the path= parameter?
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 11-10-2014, 01:37 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Yes, it is. Also, with BBSplit, I think it will try to regenerate the index every time as long as "ref=" is specified, even if it already exists, so only do that once.
Brian Bushnell is offline   Reply With Quote
Old 01-22-2015, 07:38 PM   #6
arkilis
Senior Member
 
Location: Australia

Join Date: Jul 2013
Posts: 119
Default

How to use the bbsplit to check the match details.

i.e. I might want to know the 1 seq from my reads match which seq in the ref? Is it possible to do that?

Cheers,
a
arkilis is offline   Reply With Quote
Old 01-22-2015, 09:35 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi arkilis,

The standard way to split files with bbsplit is like this:

(index)
bbsplit.sh ref=x.fa,y.fa,z.fa,...

(map)
bbsplit.sh in=reads.fq basename=out_%.sam

If you change the name of the output files to ".fq", it will come out in fastq format, which is often more useful, though reformat.sh can convert sam to fastq. Anyway, when you run bbsplit like this, it will produce one output file per reference, containing all of the reads that match that reference better than the others. In this case, it would produce 3 output files - out_x.sam, out_y.sam, and out_z.sam. You could examine the contents of the specific sam files to see exactly which contig/scaffold the read hit, but if you just need the reads split by reference file, that's how you do it.
Brian Bushnell is offline   Reply With Quote
Old 02-24-2015, 12:34 PM   #8
vingomez
Member
 
Location: USA

Join Date: Sep 2014
Posts: 18
Default

Hi Brian,


Quote:
Originally Posted by Brian Bushnell View Post

For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

For paired reads in 2 files, you would do this:

bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

This will produce 3 output files:
out_ecoli.fq (ecoli reads)
out_salmonella.fq (salmonella reads)
clean1.fq and clean2.fq (unmapped reads)


For the example provide above (minor edit for clarity):

(1) Is there a flag/command to generate paired reads for mapped reads (e.g. out_ecoli_r1.fq out_ecoli_r2.fq, out_salmonella_r1 out_salmonella_r2.fq.fq)?

(2) Will this flag/command work with BBMap?


Thanks
Vicente
vingomez is offline   Reply With Quote
Old 02-24-2015, 12:43 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Vicente,

The output will come out interleaved (alternating read1/read2) if the input was paired. There's currently not a way to make BBSplit write twin output files for the "basename" output streams (though I'll make a note to add that feature). You can de-interleave the output like this:

reformat.sh in=out_ecoli.fq out1=out_ecoli_r1.fq out2=out_ecoli_r2.fq

BBMap does not understand the "basename" flag so it doesn't work for splitting, but BBMap's "out", "outu", and "outm" all allow output of twin files using "out1" and "out2", etc.
Brian Bushnell is offline   Reply With Quote
Old 02-24-2015, 12:47 PM   #10
vingomez
Member
 
Location: USA

Join Date: Sep 2014
Posts: 18
Default

Thanks for you dedication and rapid response.

Vicente
vingomez is offline   Reply With Quote
Old 02-26-2015, 08:12 AM   #11
vingomez
Member
 
Location: USA

Join Date: Sep 2014
Posts: 18
Default BBSplit: limit amount of ref

Hi Brian,

I tried BBSplit with four ref files without any success. Is there any limit to the number of ref files?


Code:
java -Xmx29g -cp /path/to/current align2.BBSplitter in1=clean_ecc_100_r1.fastq.gz in2=clean_ecc_100_r2.fastq.gz ref=/path/to/resources/ref1.fasta,/path/to/resources/ref2.fasta,/path/to/resources/ref3.fasta,/path/to/resources/ref4.fasta basename=out_%.fastq.gz out1=unmapped_r1.fastq.gz out2=unmapped_r2.fastq.gz ambig2=split

However it work fine with 3 ref files. I tried with interactions of various group of 3 ref files with success.



P.S. When do you think is better to perform BBSplit after or before ecc/quality trim?


Thanks again
Vicente

Last edited by vingomez; 02-26-2015 at 08:29 AM. Reason: Add a question
vingomez is offline   Reply With Quote
Old 02-26-2015, 11:54 AM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Vicente,

I just tested BBSplit and it worked fine for me:
Code:
bbsplit.sh -Xmx1g ref=Clostridium_perfringensATCC_13124.fasta,Desulfotomaculum_gibsoniae_DSM_7213.fasta,Hirschia_baltica_ATCC_49814.fasta,Nocardiopsis_dassonvillei_DSM_43111.fasta

randomreads.sh -Xmx1g reads=1000 out=reads.fq

bbsplit.sh -Xmx1g in=reads.fq basename=o_%.sam
produced 4 output files:
Code:
[email protected]:/global/scratch2/sd/bushnell/splittest$ ls -l
total 38144
-rw-rw-r-- 1 bushnell bushnell  3311027 Feb 26 10:24 Clostridium_perfringensATCC_13124.fasta
-rw-rw-r-- 1 bushnell bushnell  4952723 Feb 26 10:24 Desulfotomaculum_gibsoniae_DSM_7213.fasta
-rw-rw-r-- 1 bushnell bushnell  3599251 Feb 26 10:23 Hirschia_baltica_ATCC_49814.fasta
-rw-rw-r-- 1 bushnell bushnell  6652569 Feb 26 10:23 Nocardiopsis_dassonvillei_DSM_43111.fasta
-rw-rw-r-- 1 bushnell bushnell    81824 Feb 26 11:43 o_Clostridium_perfringensATCC_13124.sam
-rw-rw-r-- 1 bushnell bushnell   125594 Feb 26 11:43 o_Desulfotomaculum_gibsoniae_DSM_7213.sam
-rw-rw-r-- 1 bushnell bushnell    81422 Feb 26 11:43 o_Hirschia_baltica_ATCC_49814.sam
-rw-rw-r-- 1 bushnell bushnell   181854 Feb 26 11:43 o_Nocardiopsis_dassonvillei_DSM_43111.sam
-rw-rw-r-- 1 bushnell bushnell   352650 Feb 26 11:42 reads.fq
drwxrwsr-x 4 bushnell bushnell      512 Feb 26 10:24 ref
When you say "without success", what exactly happens? And by the way, to catch unmapped reads, you need to set "outu1" and "outu2", not "out1" and "out2".

As for your other question - you can do quality-trimming before or after, but error-correction should be done after. Generally, I would do quality-trimming before if you use BBSplit's default settings, which require high identity, or after if you reduce the identity threshold to, say, minid=0.75 (which is closer to BBMap's default). But error-correction is best done after, on each file individually, so that homologous areas shared between the genomes don't affect each other.
Brian Bushnell is offline   Reply With Quote
Old 02-26-2015, 01:11 PM   #13
vingomez
Member
 
Location: USA

Join Date: Sep 2014
Posts: 18
Default

Hi Brian,


Sorry for the lack of information. When I said "without success": the program stop and the prompt appear (see below - last 9 lines) . Again only when I used 4 ref files. I used the same commands listed in the previous post (with and without the ambig2=split flag) and used the flags outu1 and outu2 (as you suggested) with the same outcome. Maybe a JAVA conflict?



HTML Code:
Loaded Reference:       0.010 seconds.
Loading index for chunk 1-0, build 1
Generated Index:        0.039 seconds.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
        at align2.BBIndex.analyzeIndex(BBIndex.java:115)
        at align2.BBMap.loadIndex(BBMap.java:405)
        at align2.BBMap.main(BBMap.java:32)
        at align2.BBSplitter.main(BBSplitter.java:45)
sol1:/work/MP/bbmap%
Thanks
Vicente

Last edited by vingomez; 02-26-2015 at 01:22 PM.
vingomez is offline   Reply With Quote
Old 02-26-2015, 04:03 PM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Vicente,

There appears to be something wrong with the index, or possibly fasta files. Would it be possible for you to zip the references together and email them to me so I can try to replicate this?

-Brian
Brian Bushnell is offline   Reply With Quote
Old 03-02-2015, 01:34 PM   #15
vingomez
Member
 
Location: USA

Join Date: Sep 2014
Posts: 18
Default

Hi Brian,


Quote:
Originally Posted by Brian Bushnell View Post
Hi Vicente,

There appears to be something wrong with the index, or possibly fasta files. Would it be possible for you to zip the references together and email them to me so I can try to replicate this?

-Brian

Definitely, the index files were corrupted. Just deleted the ref directory, repeat index and map - no problem.

Thanks again
vingomez is offline   Reply With Quote
Old 03-12-2015, 03:32 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by vingomez View Post
(1) Is there a flag/command to generate paired reads for mapped reads (e.g. out_ecoli_r1.fq out_ecoli_r2.fq, out_salmonella_r1 out_salmonella_r2.fq.fq)?
Hi Vicente,

I released a new version of BBTools, 34.64. BBSplit now has the ability to output paired reads in dual files using the # symbol. For example:

bbsplit.sh ref=x.fa,y.fa in1=read1.fq in2=read2.fq basename=o%_#.fq

will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq

You can use the # symbol for input also, like "in=read#.fq", and it will get expanded into 1 and 2.
Brian Bushnell is offline   Reply With Quote
Old 03-13-2015, 01:02 PM   #17
manuelkleiner
Junior Member
 
Location: Calgary

Join Date: Jun 2014
Posts: 5
Default Problem with argument length

Dear Brian,
I have run into a problem when trying to create an index for bbsplit using a large number of files (~20,000). In the example below the variable $files contains 20,000 comma delimited, file names.

Code:
[email protected][BBMap] bbsplit.sh -Xmx100g threads=20 ref=$files                                                                                                                         [ 2:47PM]
zsh: argument list too long: bbsplit.sh
What I have found out so far is that the linux kernel has a upper limit for the argument length in the shell, which causes this problem. I have tried various workarounds e.g. trying to pipe the file names in from STDIN or creating a bash script. But nothing has worked.
Any suggestions how to get bbsplit to indext such a large number of files before running the splitting?
Maybe one solution would be if one could hand the ref= argument a folder containing all fasta files and then bbsplit writes an index for each .fasta file in the folder!?

Thanks a lot in advance for your help,
Manuel
manuelkleiner is offline   Reply With Quote
Old 03-13-2015, 02:00 PM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Manuel,

I will add the possibility to point "ref=" to a directory in the next release; that's a pretty good idea.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 03-13-2015, 02:35 PM   #19
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,576
Default

Brian: Concatenating some files to make the number smaller should make it work now, correct?
GenoMax is online now   Reply With Quote
Old 03-13-2015, 03:10 PM   #20
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Well... the easiest way to use BBSplit is to have it split input into one output file per reference file. Once you concatenate them, you lose that ability. It's technically possible to specify sets of named sequences within a file but that way is too complicated. Using a tiered approach is possible (though a pain):

for references a, b, c, d, e, f:

abc=cat a b c
def=cat d e f

bbsplit ref=abc,def

then re-split the abc pile between a, b, and c, and the def pile between d, e, and f. That will basically cut the command line lengths (and open file handles) by a factor of the square root of the number of references.

But, I will add this feature quickly since it's hard to do otherwise.

Note that if all you care about are statistics of which reads mapped to which scaffolds, then if all the ref sequences have unique names, you can just run BBMap on the concatenated file with the "scafstats" output flag. But for actually splitting the files easily, the references currently need to be in individual files.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
aligners, bbsplit, binning, contaminant, metagenome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:49 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO