Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Introducing BBMerge: A paired-end read merger

    I'd like to introduce a new read-pairing tool, BBMerge, and present a comparison with some similar tools, FLASH, PEAR, COPE, and fastq-join. BBMerge has actually existed since Aug. 2012 but I only recently got around to benchmarking it.

    First, what does BBMerge do? Illumina paired reads can be intentionally generated with an insert size (length of template molecule) that is shorter than the sum of the lengths of read 1 and read 2. For example, at JGI, the bulk of our data is 2x150bp fragment libraries designed to have an average insert size of 270bp. Because they overlap, the two reads to be merged into a single, longer read. Single reads are easier to assemble, and longer reads give a better assembly; furthermore, the merging process can detect or correct errors where the reads overlap, yielding a lower error rate than the raw reads. But if the reads are merged incorrectly – yielding a single read with a different length than the actual insert size of the original molecule – errors will be added. So for this process to be useful, it is crucial that reads be merged conservatively enough that the advantages outweigh the disadvantages from incorrect merges.

    On to the comparison. I generated 4 million synthetic pairs of 2x150bp reads from E.coli, with errors mimicking the Illumina quality profile. The insert size mean was 270bp and the standard deviation was 30bp, with the min and max insert sizes capped at 3 standard deviations. The exact command was:

    randomreads.sh ref=ecoli_K12.fa reads=4000000 paired interleaved out=reads.fq.gz zl=6 len=150 mininsert=180 maxinsert=360 bell maxq=38 midq=28 minq=18 qvariance=16 ow nrate=0.05 maxns=2 maxnlen=1

    Then I renamed these reads with names indicating their insert size, e.g. “insert=191” for a pair with insert size 191bp:

    bbrename.sh in=reads.fq.gz int renamebyinsert out1=r1.fq out2=r2.fq

    Next I merged the reads with each program, varying the stringency along one parameter.

    The FLASH command varied the –x parameter from 0.01 (the min allowed) through 0.25 (the default):

    /path/FLASH -M 150 -x 0.25 r1.fq r2.fq 1>FLASH.log

    The BBMerge command used 5 different flags in terms of decreasing stringency, “strict”, “fast”, default, “loose”, “vloose”:

    java -da -Xmx400m -cp /path/ jgi.BBMerge in1=r1.fq in2=r2.fq out=x.fq minoi=150

    The PEAR command varied the p-value from 0.0001 to 1 (note that only 5 values were allowed by the program):

    pear -n 150 -p 1 -f r1.fq -r r2.fq -o outpear

    The COPE command varied the -c (minimum match ratio parameter) from 1.0 (no mismatches) to 0.75 (the default):

    ./cope -a /dev/shm/r1.fq -b /dev/shm/r2.fq -o /dev/shm/cmerged.fq -2 /dev/shm/r1_unconnected.fq -3 /dev/shm/r2_unconnected.fq -m 0 -s 33 -c 1.0 -u 150 -N 0 1>cope.o 2>&1

    The fastq-join (from ea-utils) command line swept -p from 0 to 25 and -m from 4 to 12 parameters; only the -p sweep is shown, as it gave better results. Sample command:

    fastq-join r1.fq r2.fq -o fqj -p 0

    All programs (except fastq-merge, which does not support the option) were capped at not allowing insert sizes shorter than 150bp. This is because FLASH (which appears to be currently the most popular merger) does not seem to be capable of handling reads with insert sizes shorter than read length. If FLASH is forced to allow such overlaps (using the –O flag) then the results it produces from such reads are 100% wrong, resulting in performance that seems bad to the point of being hard to believe, which would undermine this analysis, so I limited the range of insert sizes to at minimum 180, which FLASH can handle. The output was graded by my grading script to determine the percent correct and incorrect:

    grademerge.sh in=x.fq

    Then the results were plotted in a ROC curve, attached. The X-axis is logarithmic in one figure and linear in the other.




    FLASH achieves a slightly high merging rate at its default sensitivity, but does so at the expense of an extremely high false positive rate. BBMerge has a systematically lower false positive rate than FLASH at any sensitivity. fastq-join yields the highest merging rate, at the cost of the highest false-positive rate, second only to PEAR. COPE is very competitive with FLASH and strictly exceeds it with "N=0"; unfortunately it is unstable at that setting. COPE is stable at "N=1" (the default) but is generally inferior to FLASH at that setting. PEAR appears to be inferior to all other options across the board.

    Next, I tested the speed. Since BBMerge, FLASH, and PEAR are multithreaded, I tested with various numbers of threads to determine peak performance. The tests were run on a 4x8-core Sandy Bridge E node with no other processes running; these had hyperthreading enabled, for 32 cores and 64 threads. Input came from a ramdisk and was sent to a ramdisk, so I/O bandwidth was not limiting. The results attached to this post indicate what happens when input was from local disk and output went to a networked filesystem, but that was I/O limited; the image embedded below reflects the ramdisk run.



    FLASH has a slight advantage with a low number of threads, but BBMerge ultimately scales better to reach a higher peak of 235 megabasepairs/s (Mbp/s) versus FLASH’s 160 Mbp/s. BBMerge’s peak came at 20 threads and FLASH’s at 36 threads. 235 Mbp/s in this case equates to 499 megabytes per second and BBMerge was probably input-limited (by CPU usage of the input thread) at that point. PEAR was very slow so I did not run the full suite of speed testing on it, but it achieved 11.8 Mbp/s at 16 threads (on slightly different hardware). The single-threaded programs (COPE and fastq-join) beat PEAR up to around 16 threads, but never touched BBMerge or FLASH. Interestingly, PEAR is the only program that scales well with hyperthreading; I speculate that the program does a lot of floating-point operations.

    In conclusion, for this dataset, BBMerge had substantially better specificity at similar sensitivity to all competitors. fastq-join had the highest merge rate, by a slight margin, but at the cost of a false positive rate that I would consider unusable in a production environment. The speeds of BBMerge and FLASH are similar per thread, with FLASH being around 10% higher; but BBMerge appears to be more scalable, and as such exceeded 150% of FLASH's peak throughput. So, for this dataset, unless false positives are irrelevant, BBMerge wins.

    It appears imprudent to use FLASH (or most other mergers) with the default sensitivity, due to the huge increase in false positives for virtually no gain in true positives. On the ROC curve, FLASH's rightmost point is the default “-x 0.25”, while the third from the right is “-x 0.15”, which gives a much better signal-to-noise ratio of 33.3dB and 71.1% merged compared to the default 28.8dB and 73.8% merged. If you desire the highest merge rate regardless of false-positives, fastq-join appears optimal. For libraries that have all N-containing reads removed, it may be that COPE is slightly better than FLASH, though I have not verified this. If false-positives are detrimental to your project, I recommend BBMerge on default sensitivity, which gave a 42.7dB SNR and 55.8% merge rate for this data. The merge rate can be increased using the "loose" or "vloose" flags, at the expense of more false-positives, but I believe that for most projects, minimizing noise is more important than maximizing signal.

    BBMerge is available with the BBTools package here:

    Download BBMap for free. BBMap short read aligner, and other bioinformatic tools. This package includes BBMap, a short read aligner, as well as various other bioinformatic tools. It is written in pure Java, can run on any platform, and has no dependencies other than Java being installed (compiled for Java 6 and higher).


    On Linux systems you can simply run the shellscript, like this:

    bbmerge.sh in1=read1.fq in2=read2.fq out=merged.fq

    BBMerge supports interleaved reads, ASCII-33 or 64, gzip, fasta, scarf, and various other file formats. It also makes handy insert-size distribution histograms. Everything in the BBTools package is compatible with any OS that supports Java - Linux, Windows, Mac, even cell phones. The shellscripts may only work in Linux, but the actual programs will run anywhere. For more details, run the shellscript with no arguments, or post a question here.

    EDIT: Results updated March 25, 2015, on a poster:
    Attached Files
    Last edited by Brian Bushnell; 03-25-2015, 03:38 PM. Reason: Updated embedded images

  • #2
    Brian: You have put a single help document together for all the programs in BBMap. I am not sure you have done that.

    I am losing track of all the programs that are now part of BBMap suite

    Comment


    • #3
      I have updated the speed results by moving the input and output files to a ramdisk to eliminate I/O bottlenecks, and finishing all datapoints for PEAR. Results attached. Also, if anyone has a different program they use for read merging that they want added to the roundup, please post it here before my test data is purged from /scratch in ~6 weeks.

      Originally posted by GenoMax View Post
      Brian: You have put a single help document together for all the programs in BBMap. I am not sure you have done that.

      I am losing track of all the programs that are now part of BBMap suite
      I am too... I definitely need to put together such a document! Hopefully soon. I have an internal document describing them but it's not totally relevant outside our cluster.
      Attached Files

      Comment


      • #4
        Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)

        Comment


        • #5
          You may want to add COPE to your tests since it is published. Also, there are a number of other unpublished (I think) programs on this blog post, but I don't think it's necessary to test every script out there. I mention COPE because it works with Fasta, whereas the other tools require Fastq, so that is a major convenience that this tool provides.

          There are a couple of caveats with COPE that I can share. 1) The paired reads have to be in the same order in the forward and reverse files. The program will run just fine if they are not, but the number of reads merged will be very low (this would be a nice thing to check for). 2) There is an option for setting the overlap mode and this has to be given or the program will fail silently ("-m 0" is the simple overlap mode).

          Comment


          • #6
            Originally posted by jimmybee View Post
            Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)
            jimmybee,

            FYI, the latest version of BBMerge by default rejects insert sizes shorter than 35 and does not look for insert sizes shorter than 25, so if you have super-short insert libraries, you might want to reduce those with these flags:
            mininsert=12 minoi=12

            "mininsert" will reject anything shorter than that value as "too short", while "minoi" (minoverlapinsert) will tell the program to not look for inserts shorter than that. The difference is subtle. Sorry, the default values listed for them in the shellscript are wrong; I will fix that with the next release. Of course for most purposes you don't ever see insert sizes that short but you can in certain experiments involving tiny RNAs.

            I have no plans to explicitly add adapter matching to BBMerge, but it sounds like it may be useful to add a mode that trims reads with insert sizes shorter than read length rather than merging them. Also, it would be fairly easy to add overlap-detection to BBDuk for greater adapter-trimming sensitivity. Hmmm, I'll probably do both.

            @SES,

            For some reason I can't get COPE to work:

            /dev/shm$ ./cope -a /dev/shm/sr1.fq -b /dev/shm/sr2.fq -o /dev/shm/cmerged.fq -m 0 -s 33
            Program start..
            Program: COPE (Connecting Overlapped Pair-End reads)
            Version: v1.1.3
            Author: BGI-ShenZhen
            CompileDate: Jun 6 2014 time: 10:30:24
            Current time: Fri Jun 6 10:46:56 2014
            Command line: ./cope -a /dev/shm/sr1.fq -b /dev/shm/sr2.fq -o /dev/shm/cmerged.fq -m 0 -s 33
            Error: input file is empty, please check the set of input file!
            Other programs can see the files fine. I moved them to various places and tried both absolute and relative paths. Any ideas?

            Comment


            • #7
              Originally posted by Brian Bushnell View Post

              @SES,

              For some reason I can't get COPE to work:


              Other programs can see the files fine. I moved them to various places and tried both absolute and relative paths. Any ideas?
              Try specifying the files for unconnected pairs. I would try:

              Code:
              ./cope -a /dev/shm/sr1.fq -b /dev/shm/sr2.fq -o /dev/shm/cmerged.fq -2 /dev/shm/sr1_unconnected.fq -3 /dev/shm/sr2_unconnected.fq -m 0 -s 33
              I know that solution doesn't logicially follow from the error message, but from what I remember, you will get either no messages on failure with this program or nondescript messages. If that doesn't work, I'd check the input again, and then try without the "-s 33" as a last suggestion. It could be the input, but I'm guessing it's just the combination of options, or lack of, that is causing the issue.

              Comment


              • #8
                Thanks SES, that solved it. I guess all outputs must be specified. I've attached the new graphs and updated the inline images in the first post.

                COPE performed much better when I set the flag "-N 0" because 5% of my synthetic reads contain an N, so I used that flag, but it was unstable and seg-faulted at and value of -c under 0.92. I included both settings, "-N 1" (default) and "-N 0". Overall it was very close to FLASH when I swept through the -c parameter (match ratio cutoff) from 1.0 to 0.75 (the default) - slightly better with -N 1 (but unstable) and slightly worse with -N 0 (but stable). There are also -d and -B parameters that probably impact the merge rate, which I left at the default, so as with most of the other programs, it's still possible that better results could be achieved. Sample command:

                ./cope -a /dev/shm/r1.fq -b /dev/shm/r2.fq -o /dev/shm/cmerged.fq -2 /dev/shm/r1_unconnected.fq -3 /dev/shm/r2_unconnected.fq -m 0 -s 33 -c 1.0 -u 150 -N 0 1>cope.o 2>&1

                I also added fastq-join (from ea-utils), which was generally outperformed by FLASH. I swept through the -p (from 0 to 25) and -m (from 4 to 12) parameters; only the -p sweep is shown, as it gave better results. Sample command:

                fastq-join r1.fq r2.fq -o fqj -p 0

                These were similar in speed - both were much faster than PEAR and slightly slower than BBMerge or FLASH, when those three were limited to a single thread. fastq-join is slightly faster than COPE.
                Attached Files
                Last edited by Brian Bushnell; 06-06-2014, 04:26 PM.

                Comment


                • #9
                  Is there a list of parameters that I can use? For example, insert size and cpu threads. I tried to assembled a set of ~24 million pair-end reads and found that only 44.59% were assembled. Is there a way to increase (e.g. set loose)? Also, the input reads were 100bp in length and 10th percentile showed size 74. Why is that? By the way, how can the insert range as high as 188?

                  Reads: 24316810
                  Joined: 10843877 44.594%
                  Ambiguous: 62051 0.255%
                  No Solution: 13397906 55.097%
                  Too Short: 12976 0.053%
                  Avg Insert: 132.7

                  Insert range: 26 - 188
                  90th percentile: 178
                  50th percentile: 140
                  10th percentile: 74

                  Bests,
                  Woody
                  Last edited by woodydon; 06-17-2014, 12:24 AM.

                  Comment


                  • #10
                    Woody,

                    If you run the shellscript bbmerge.sh with no parameters, it will display the help information. To answer your specific questions:

                    By default it uses all available threads, but you can override that with the "t=" flag. You can get a higher merge rate at the expense of more false positives with the "loose" or "vloose" (very loose) flag. It's also possible to more finely tune it with other flags like "minoverlap", "margin", "entropy", and "mismatches", but those are more complicated to tweak. You can use the "mininsert=" flag to ban merges with an insert size shorter than that value.

                    2x100bp reads can have a 74bp insert or a 188bp insert size. 74bp insert means that the molecule being sequenced was shorter than read length, and as a result the data collection continued off the end of the genomic sequence and into the adapter. So, before merging, the reads each contained 26bp of junk. And a 188bp insert size means that the reads overlapped by 12 base pairs. BBMerge does not look for overlaps shorter than 12bp in the default mode; the shorter the overlap, the more likely that it's a false positive.

                    The higher the error rate of reads, the fewer successful joins there will be. If you have sufficient coverage (at least 30x average) you can try error-correcting the reads first with my error correction tool; that should increase the merge rate.

                    ecc.sh in1=r1.fq in2=r2.fq out1=corrected1.fq out2=corrected2.fq

                    -Brian

                    Comment


                    • #11
                      Originally posted by jimmybee View Post
                      Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)
                      jimmybee,

                      I added some relevant flags and modes, in version 32.32. Thanks for the suggestions!

                      First, BBMerge now has a new flag, "tbo" (trimbyoverlap). Rather than joining reads, it will just trim the adapter sequence off the ends of reads that have an apparent insert size shorter than read length.

                      Second, I added the ability to overlap to BBDuk also, with the same flag, "tbo". So if you are trimming adapters fragment libraries with the normal read orientation, you can trim both by specifying an adapter sequence file and by overlap at the same time, which drastically reduces the false negative rate (reads that did not have adapters completely removed) by around a factor of 12. I highly recommend using both for fragment libraries, as the combination outperforms either one alone:

                      bbduk.sh in=reads.fq out=trimmed.fq tbo ktrim=r ref=truseq.fa k=25 mink=1 hdist=1

                      The 'tbo' flag is NOT recommended for long mate pair libraries which may have the adapters in totally different places and are not expected to overlap, which is why it is disabled by default.

                      -Brian

                      Comment


                      • #12
                        Originally posted by Brian Bushnell View Post
                        Woody,

                        2x100bp reads can have a 74bp insert or a 188bp insert size. 74bp insert means that the molecule being sequenced was shorter than read length, and as a result the data collection continued off the end of the genomic sequence and into the adapter. So, before merging, the reads each contained 26bp of junk. And a 188bp insert size means that the reads overlapped by 12 base pairs. BBMerge does not look for overlaps shorter than 12bp in the default mode; the shorter the overlap, the more likely that it's a false positive.

                        -Brian
                        Thanks for the reply. It turns out that I confused your "insert size" with "inner distance" of pair-end reads. Since the ~25 million reads were exome-seq data, 44.594% assemble rate is a bit lower than what I expected. If the assemble is perfect, it means 50% of the fragments were actually longer than 188. I hope you could improve those fragments whose "insert size" is longer than 188 in the future.

                        Comment


                        • #13
                          Originally posted by woodydon View Post
                          Thanks for the reply. It turns out that I confused your "insert size" with "inner distance" of pair-end reads. Since the ~25 million reads were exome-seq data, 44.594% assemble rate is a bit lower than what I expected. If the assemble is perfect, it means 50% of the fragments were actually longer than 188. I hope you could improve those fragments whose "insert size" is longer than 188 in the future.
                          Woody,

                          Did you try error-correcting and using the 'vloose' setting? Both of those can substantially improve the merge rate.

                          If you want to see what the actual, unbiased insert size distribution of your data, then you can generate a histogram by mapping:

                          bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fa ihist=ihist.txt nodisk

                          Then plot a graph of 'ihist.txt', so you can see how many reads should be expected to assemble.

                          Comment


                          • #14
                            Originally posted by Brian Bushnell View Post
                            Woody,

                            Did you try error-correcting and using the 'vloose' setting? Both of those can substantially improve the merge rate.

                            If you want to see what the actual, unbiased insert size distribution of your data, then you can generate a histogram by mapping:

                            bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fa ihist=ihist.txt nodisk

                            Then plot a graph of 'ihist.txt', so you can see how many reads should be expected to assemble.
                            Great. That's an neat option.

                            Comment


                            • #15
                              Hi Brain,

                              I am using a computing cluster to run BBMerge and found the following error:

                              java -ea -Xmx48g -cp /home/xxx/tools/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx48g in1=/home/xxx/reads/n_R1.fastq in2=/home/xxx/reads/n_R2.fastq ref=/home/xxx/ref/hg19c/genome_hg19.fa ihist=/home/xxx/bbmap/n_ihist.txt out=/home/xxx/bbmap/n_merged.fq
                              /home/xxx/tools/bbmap/bbmap.sh: line 176: java: command not found
                              my command was:
                              /home/xxx/tools/bbmap/bbmap.sh -Xmx48g threads=10 in1=/home/xxx/reads/n_R1.fastq in2=/home/xxx/reads/n_R2.fastq ref=/home/xxx/ref/hg19c/genome_hg19.fa ihist=/home/xxx/bbmap/n_ihist.txt out=/home/xxx/bbmap/n_merged.fq

                              Thanks,
                              Woody
                              Last edited by woodydon; 06-22-2014, 09:36 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              71 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              80 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X