Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    Introducing BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

    BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special "ambiguous" file for each of them. Paired reads will always be kept together.

    For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

    bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq int=t

    This will produce 3 output files:
    out_ecoli.fq (ecoli reads)
    out_salmonella.fq (salmonella reads)
    clean.fq (unmapped reads)

    In this case, "int=t" means that the input file is paired and interleaved. For single-end reads you would leave that out. For paired reads in 2 files, you would do this:
    bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

    You can get more information about parameters by running bbsplit.sh with no arguments, or reading /bbmap/docs/readme.txt. But I will mention here the inter-reference ambiguity modes, which decide what to do with reads that map to multiple references and pairs where each read maps to a different reference:

    ambig2=best
    Default. Ambiguous reads go to the first best site.

    ambig2=toss
    Ambiguous reads are considered unmapped.

    ambig2=all
    Write a copy to the output for each reference to which it maps.

    ambig2=split
    Write a copy to the AMBIGUOUS_ output file for each reference to which it maps.

    If your OS cannot process bash shellscripts, replace "bbsplit.sh" with "java -Xmx29g -cp /path/to/current align2.BBSplitter", where /path/to/current is the location of the 'current' directory (a subdirectory of bbmap), and -Xmx29g specifies the amount of memory to use (so this would be the command line for a 32GB computer). This should be set to about 85% of physical memory.

    BBSplit is extremely fast and highly sensitive, using BBMap for the mapping. So, all flags and features supported by BBMap can be used with BBSplit (aside from sam output).

    BBSplit is available here:
    Download BBMap for free. BBMap short read aligner, and other bioinformatic tools. This package includes BBMap, a short read aligner, as well as various other bioinformatic tools. It is written in pure Java, can run on any platform, and has no dependencies other than Java being installed (compiled for Java 6 and higher).


    P.S. Some people have asked why BBSplit has a lower alignment rate than BBMap. That is because it has a lower default sensitivity, as the original intent was to bin reads using known assemblies. The sensitivity can be raised to be equivalent to BBMap with these flags: "minratio=0.56 minhits=1 maxindel=16000"
    Last edited by Brian Bushnell; 09-16-2014, 08:29 AM.
  • vingomez
    Member
    • Sep 2014
    • 18

    #2
    Thanks Brian for your effort in providing bioinformatic applications,

    As you mentioned in your post, BBsplit can use two paired-end files as input. In addition to the two files; Can I add a third file (e.g. merged read file) as input?



    Thanks again


    P.S. In previous post "java -Xmx29g -cp /path/to/current align2.BBSplit", must said align2.BBSplitter"
    Last edited by vingomez; 09-16-2014, 07:22 AM.

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      #3
      Originally posted by vingomez View Post
      Thanks Brian for your effort in providing bioinformatic applications,

      As you mentioned in your post, BBsplit can use two paired-end files as input. In addition to the two files; Can I add a third file (e.g. merged read file) as input?
      No, you'll have to do that in two runs, one for the paired reads and one for the merged reads. But ultimately that won't affect the output or runtime (other than the fact that the index will need to be loaded twice, and you'll end up with 2 sam files that need to be merged).

      P.S. In previous post "java -Xmx29g -cp /path/to/current align2.BBSplit", must said align2.BBSplitter"
      Fixed, thanks!

      Comment

      • sdriscoll
        I like code
        • Sep 2009
        • 436

        #4
        After running bbsplit once using the syntax: ref=ref1.fa,ref2.fa, is it possible to re-use that index on subsequent runs using the path= parameter?
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #5
          Yes, it is. Also, with BBSplit, I think it will try to regenerate the index every time as long as "ref=" is specified, even if it already exists, so only do that once.

          Comment

          • arkilis
            Senior Member
            • Jul 2013
            • 119

            #6
            How to use the bbsplit to check the match details.

            i.e. I might want to know the 1 seq from my reads match which seq in the ref? Is it possible to do that?

            Cheers,
            a

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              Hi arkilis,

              The standard way to split files with bbsplit is like this:

              (index)
              bbsplit.sh ref=x.fa,y.fa,z.fa,...

              (map)
              bbsplit.sh in=reads.fq basename=out_%.sam

              If you change the name of the output files to ".fq", it will come out in fastq format, which is often more useful, though reformat.sh can convert sam to fastq. Anyway, when you run bbsplit like this, it will produce one output file per reference, containing all of the reads that match that reference better than the others. In this case, it would produce 3 output files - out_x.sam, out_y.sam, and out_z.sam. You could examine the contents of the specific sam files to see exactly which contig/scaffold the read hit, but if you just need the reads split by reference file, that's how you do it.

              Comment

              • vingomez
                Member
                • Sep 2014
                • 18

                #8
                Hi Brian,


                Originally posted by Brian Bushnell View Post

                For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

                For paired reads in 2 files, you would do this:

                bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

                This will produce 3 output files:
                out_ecoli.fq (ecoli reads)
                out_salmonella.fq (salmonella reads)
                clean1.fq and clean2.fq (unmapped reads)


                For the example provide above (minor edit for clarity):

                (1) Is there a flag/command to generate paired reads for mapped reads (e.g. out_ecoli_r1.fq out_ecoli_r2.fq, out_salmonella_r1 out_salmonella_r2.fq.fq)?

                (2) Will this flag/command work with BBMap?


                Thanks
                Vicente

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  #9
                  Hi Vicente,

                  The output will come out interleaved (alternating read1/read2) if the input was paired. There's currently not a way to make BBSplit write twin output files for the "basename" output streams (though I'll make a note to add that feature). You can de-interleave the output like this:

                  reformat.sh in=out_ecoli.fq out1=out_ecoli_r1.fq out2=out_ecoli_r2.fq

                  BBMap does not understand the "basename" flag so it doesn't work for splitting, but BBMap's "out", "outu", and "outm" all allow output of twin files using "out1" and "out2", etc.

                  Comment

                  • vingomez
                    Member
                    • Sep 2014
                    • 18

                    #10
                    Thanks for you dedication and rapid response.

                    Vicente

                    Comment

                    • vingomez
                      Member
                      • Sep 2014
                      • 18

                      #11
                      BBSplit: limit amount of ref

                      Hi Brian,

                      I tried BBSplit with four ref files without any success. Is there any limit to the number of ref files?


                      Code:
                      java -Xmx29g -cp /path/to/current align2.BBSplitter in1=clean_ecc_100_r1.fastq.gz in2=clean_ecc_100_r2.fastq.gz ref=/path/to/resources/ref1.fasta,/path/to/resources/ref2.fasta,/path/to/resources/ref3.fasta,/path/to/resources/ref4.fasta basename=out_%.fastq.gz out1=unmapped_r1.fastq.gz out2=unmapped_r2.fastq.gz ambig2=split

                      However it work fine with 3 ref files. I tried with interactions of various group of 3 ref files with success.



                      P.S. When do you think is better to perform BBSplit after or before ecc/quality trim?


                      Thanks again
                      Vicente
                      Last edited by vingomez; 02-26-2015, 08:29 AM. Reason: Add a question

                      Comment

                      • Brian Bushnell
                        Super Moderator
                        • Jan 2014
                        • 2709

                        #12
                        Hi Vicente,

                        I just tested BBSplit and it worked fine for me:
                        Code:
                        bbsplit.sh -Xmx1g ref=Clostridium_perfringensATCC_13124.fasta,Desulfotomaculum_gibsoniae_DSM_7213.fasta,Hirschia_baltica_ATCC_49814.fasta,Nocardiopsis_dassonvillei_DSM_43111.fasta
                        
                        randomreads.sh -Xmx1g reads=1000 out=reads.fq
                        
                        bbsplit.sh -Xmx1g in=reads.fq basename=o_%.sam
                        produced 4 output files:
                        Code:
                        bushnell@gpint109:/global/scratch2/sd/bushnell/splittest$ ls -l
                        total 38144
                        -rw-rw-r-- 1 bushnell bushnell  3311027 Feb 26 10:24 Clostridium_perfringensATCC_13124.fasta
                        -rw-rw-r-- 1 bushnell bushnell  4952723 Feb 26 10:24 Desulfotomaculum_gibsoniae_DSM_7213.fasta
                        -rw-rw-r-- 1 bushnell bushnell  3599251 Feb 26 10:23 Hirschia_baltica_ATCC_49814.fasta
                        -rw-rw-r-- 1 bushnell bushnell  6652569 Feb 26 10:23 Nocardiopsis_dassonvillei_DSM_43111.fasta
                        -rw-rw-r-- 1 bushnell bushnell    81824 Feb 26 11:43 o_Clostridium_perfringensATCC_13124.sam
                        -rw-rw-r-- 1 bushnell bushnell   125594 Feb 26 11:43 o_Desulfotomaculum_gibsoniae_DSM_7213.sam
                        -rw-rw-r-- 1 bushnell bushnell    81422 Feb 26 11:43 o_Hirschia_baltica_ATCC_49814.sam
                        -rw-rw-r-- 1 bushnell bushnell   181854 Feb 26 11:43 o_Nocardiopsis_dassonvillei_DSM_43111.sam
                        -rw-rw-r-- 1 bushnell bushnell   352650 Feb 26 11:42 reads.fq
                        drwxrwsr-x 4 bushnell bushnell      512 Feb 26 10:24 ref
                        When you say "without success", what exactly happens? And by the way, to catch unmapped reads, you need to set "outu1" and "outu2", not "out1" and "out2".

                        As for your other question - you can do quality-trimming before or after, but error-correction should be done after. Generally, I would do quality-trimming before if you use BBSplit's default settings, which require high identity, or after if you reduce the identity threshold to, say, minid=0.75 (which is closer to BBMap's default). But error-correction is best done after, on each file individually, so that homologous areas shared between the genomes don't affect each other.

                        Comment

                        • vingomez
                          Member
                          • Sep 2014
                          • 18

                          #13
                          Hi Brian,


                          Sorry for the lack of information. When I said "without success": the program stop and the prompt appear (see below - last 9 lines) . Again only when I used 4 ref files. I used the same commands listed in the previous post (with and without the ambig2=split flag) and used the flags outu1 and outu2 (as you suggested) with the same outcome. Maybe a JAVA conflict?



                          HTML Code:
                          Loaded Reference:       0.010 seconds.
                          Loading index for chunk 1-0, build 1
                          Generated Index:        0.039 seconds.
                          Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
                                  at align2.BBIndex.analyzeIndex(BBIndex.java:115)
                                  at align2.BBMap.loadIndex(BBMap.java:405)
                                  at align2.BBMap.main(BBMap.java:32)
                                  at align2.BBSplitter.main(BBSplitter.java:45)
                          sol1:/work/MP/bbmap%
                          Thanks
                          Vicente
                          Last edited by vingomez; 02-26-2015, 01:22 PM.

                          Comment

                          • Brian Bushnell
                            Super Moderator
                            • Jan 2014
                            • 2709

                            #14
                            Hi Vicente,

                            There appears to be something wrong with the index, or possibly fasta files. Would it be possible for you to zip the references together and email them to me so I can try to replicate this?

                            -Brian

                            Comment

                            • vingomez
                              Member
                              • Sep 2014
                              • 18

                              #15
                              Hi Brian,


                              Originally posted by Brian Bushnell View Post
                              Hi Vicente,

                              There appears to be something wrong with the index, or possibly fasta files. Would it be possible for you to zip the references together and email them to me so I can try to replicate this?

                              -Brian

                              Definitely, the index files were corrupted. Just deleted the ref directory, repeat index and map - no problem.

                              Thanks again

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              12 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...