Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Hi Brian,

    My apologies is I missed this, does this tool allow for stitching PE reads without an insert?

    Comment


    • #32
      I don't understand what you mean by "without an insert". BBMerge will stitch together PE reads that overlap only. However, there's another program in BBTools with the shellscript "bbmergegapped.sh" that can join reads that are nonoverlapping, if you have sufficient coverage, and if the insert size is not too big. It uses more memory, though. For example:

      bbmergegapped.sh -Xmx31g in=reads.fq out=merged.fq gap=50

      This will join reads that are overlapping, and nonoverlapping with a gap in the middle of up to a little under 50bp. So, for example, with 2x150bp reads, bbmerge with the default settings will join reads with an insert size of up to 287bp (13bp overlap), while bbmergegapped with "gap=50" will join reads with an insert size of up to 341bp (41bp gap). But the nonoverlapping reads will only be joined if there is sufficient coverage (at least 10x or so) in your input reads.

      It's also possible for BBMerge to join reads based on coordinates from mapping, though that's a little more complicated.

      Comment


      • #33
        JNI error in loose/vloose mode

        Hi Brian,

        I'm testing BBmerge for our pipelines and am scanning the arguments a bit. I've compiled the JNI C libraries and everything runs fine with the normal settings, however when I try the "loose" or "vloose" modes I get UnsatisfiedLink errors:

        Code:
        Exception in thread "Thread-8" java.lang.UnsatisfiedLinkError: jgi.BBMergeOverlapper.mateByOverlapJNI([B[B[B[B[IIIIIIII)I
                at jgi.BBMergeOverlapper.mateByOverlapJNI(Native Method)
        The same command line works fine when I skip "usejni=t", or with the strict, fast and normal modes (these work with/without JNI, loose/vloose only work without JNI). Did I overlook something in the documentation, or is this a real bug?

        Thanks,
        Samuel

        Comment


        • #34
          The "margin" cmd line argument

          I'm scanning the command line arguments for optimal merging of the reads I have. They are amplicons (16S) with a rather well-defined fragment size (440-470), and the libraries I merge are known to be very clean (few PCR by-products), so they should merge very well (> 95 % have the correct size, but sequencing quality on R2 varies according to the MiSeq V3 chemistry lot number). I notice a considerable difference when I modify the "margin" parameter (i.e. merge rate goes from 52 % to 98 % when I lower margin to zero).

          Could you give a clear description on the heuristics controlled by this parameter? The documentation give no real clue, unfortunately...

          I'm not afraid about ambiguous merges (length-wise) - these can be pretty well controlled by filtering on the final length. However, I'm more concerned about correctly eliminating sequencing errors from R2 (and I'd like to find parameters rather robust to varying quality from R2). I would prefer not to run any k-mer error correction beforehand, to avoid skewing the true sequence diversity on the lower k-mer spectrum (these are not single species shotgun sequences). In my test runs, the loose/vloose and trimming parameters have low impact on the merge rates (up to 10 %). The margin parameter, however, has an extreme impact.
          Last edited by sarvidsson; 01-07-2015, 05:15 AM. Reason: clarification on ambiguous merges

          Comment


          • #35
            Originally posted by sarvidsson View Post
            Hi Brian,

            I'm testing BBmerge for our pipelines and am scanning the arguments a bit. I've compiled the JNI C libraries and everything runs fine with the normal settings, however when I try the "loose" or "vloose" modes I get UnsatisfiedLink errors:

            Code:
            Exception in thread "Thread-8" java.lang.UnsatisfiedLinkError: jgi.BBMergeOverlapper.mateByOverlapJNI([B[B[B[B[IIIIIIII)I
                    at jgi.BBMergeOverlapper.mateByOverlapJNI(Native Method)
            The same command line works fine when I skip "usejni=t", or with the strict, fast and normal modes (these work with/without JNI, loose/vloose only work without JNI). Did I overlook something in the documentation, or is this a real bug?

            Thanks,
            Samuel
            Hi Samuel,

            That was indeed a bug; thanks for finding it! It's fixed now in v34.25, which I've uploaded to SourceForge.

            I'm scanning the command line arguments for optimal merging of the reads I have. They are amplicons (16S) with a rather well-defined fragment size (440-470), and the libraries I merge are known to be very clean (few PCR by-products), so they should merge very well (> 95 % have the correct size, but sequencing quality on R2 varies according to the MiSeq V3 chemistry lot number). I notice a considerable difference when I modify the "margin" parameter (i.e. merge rate goes from 52 % to 98 % when I lower margin to zero).

            Could you give a clear description on the heuristics controlled by this parameter? The documentation give no real clue, unfortunately...

            I'm not afraid about ambiguous merges (length-wise) - these can be pretty well controlled by filtering on the final length. However, I'm more concerned about correctly eliminating sequencing errors from R2 (and I'd like to find parameters rather robust to varying quality from R2). I would prefer not to run any k-mer error correction beforehand, to avoid skewing the true sequence diversity on the lower k-mer spectrum (these are not single species shotgun sequences). In my test runs, the loose/vloose and trimming parameters have low impact on the merge rates (up to 10 %). The margin parameter, however, has an extreme impact.
            The way BBMerge works is to look at all possible overlaps (within the bounds specified by minoverlap0 [the number of overlapping bases for the longest possible insert size] and minoverlapinsert [the number of overlapping bases for the shortest possible insert size]). Those are confusingly named, so let me clarify - if minoverlap0=10 and minoverlapinsert=20, then the insert size range examined, for 100bp reads, will be from 190 (10bp overlap) down to 20 (20bp overlap). Within those possible overlaps, only the ones between minoverlap and mininsert will actually be considered as valid. So in this example, if minoverlap=15 and mininsert=25, then insert sizes from 190 to 20 will be tested, but only insert sizes from 185 to 25 will be potentially kept.

            So, an insert size of 192 would never even be considered; an insert size of 187 would be considered, but if it was determined to be the best overlap, it would be rejected as being too short for a high confidence merge; and an insert size of 182 would be considered, and if it was the best, accepted.

            Therefore, if you know your expected insert size range to be well-defined, you can increase the merge rate by restricting these values. For example, if you have 2x100bp reads and expect all of the insert sizes to range from 140 to 160, you could set "minoverlap0=20 minoverlap=30 minoverlapinsert=130 mininsert=120". That would restrict the insert range that is examined to 120-180, and the range that is potentially accepted to 130-170. The more you can restrict the range, the less likely that you will get spurious overlaps, and the higher the merge rate because there is less chance of ambiguity.

            However, note that the parameters specifically describe overlap lengths rather than insert sizes. That's because the due to trimming, the insert size and overlap length may no longer be correlated; and if overlap detection fails due to too many mismatches, the reads will be automatically quality-trimmed and overlapping will be re-attempted. You can disable that with the "trimonfailure" flag. So in other words, these parameters are only strictly correlated to insert size if all reads are the same length and all trimming is disabled. Still, if your pairs do not have really short overlaps (insert sizes near double read length) , you can increase both the merge rate and accuracy by increasing minoverlap and minoverlap0.

            Ambiguous overlaps - those in which the best and second-best overlap have a similar number of mismatches - are rejected. Whether or not an overlap is ambiguous is determined by the "margin" setting. If you set margin to 1, and the best overlap has to have at least 1 fewer mismatches than the second best. If you set margin to 0, then two overlaps can have the same number of mismatches, and the longer overlap will be considered best, but this will generate more false positives.

            You can empirically test the true and false positive merge rate using synthetic data, which is how I determined the default parameters; the actual optimal parameters, of course, depend on the insert size distribution, quality, read lengths, and sequence complexity of your input data.

            Comment


            • #36
              Brian, thanks for the detailed description, that helps.

              Originally posted by Brian Bushnell View Post
              Ambiguous overlaps - those in which the best and second-best overlap have a similar number of mismatches - are rejected. Whether or not an overlap is ambiguous is determined by the "margin" setting. If you set margin to 1, and the best overlap has to have at least 1 fewer mismatches than the second best. If you set margin to 0, then two overlaps can have the same number of mismatches, and the longer overlap will be considered best, but this will generate more false positives.
              Alright, so if I get that correctly, a read pair with two possible perfect match overlaps (both at least minoverlap0, and both giving fragments of at least minoverlapinsert), like the following:

              Code:
              1.
              ----------------->
                          ||||||
                          <-----------------------
              
              2.
              ----------------->
                 |||||||||||||||
                 <-----------------------
              Then, if margin=0, 2. would be selected, since this is "the longer overlap", right? 1. would give "the longest final fragment size" (I prefer to use the term "fragment", as "insert" is often used ambiguously). For my use, I would then try to control overlap with the minoverlap0 and minoverlapinsert parameters and leave the margin at 1 or 2. I can pre-trim the reads and discard too short ones quite easily. If some reads by chance end up merged as 1. above, these can be well handled downstream in the analysis pipeline.

              Originally posted by Brian Bushnell View Post
              You can empirically test the true and false positive merge rate using synthetic data, which is how I determined the default parameters; the actual optimal parameters, of course, depend on the insert size distribution, quality, read lengths, and sequence complexity of your input data.
              The problem with synthethic data is that it isn't easy to predict sequence complexity. I will try to use some data for which I have predictable outcome... The nice thing with amplicon data is that fragment size and read length are well defined parameters.

              Previously we've been using FLASh to merge read pairs, which seems to work fine for most of our applications, especially with HiSeq data. With long-overlap MiSeq data (2x300 bp reads of 150-500 bp fragments) we sometime have to use some irritating workarounds and filters to cope with some pecularities of FLASh (e.g. no dovetailing allowed) - and we've seen a bit high sequence complexity coming out of the amplicons, leading us to think that mismatching base pairs aren't handled correctly or not robustly. I'll run side-by-side comparisons with the BBmerge data to see whether it makes more sense.
              Last edited by sarvidsson; 01-08-2015, 01:35 AM. Reason: Corrected the example overlaps

              Comment


              • #37
                Another thing: would it be possible to "debug" ambiguous cases - e.g. getting the sequence identifiers and all suggested insert sizes would be of great use to find the best parameters. If there are some "hooks" i the arguments or source code, let me know...

                Comment


                • #38
                  Originally posted by sarvidsson View Post
                  Alright, so if I get that correctly, a read pair with two possible perfect match overlaps (both at least minoverlap0, and both giving fragments of at least minoverlapinsert), like the following:

                  Code:
                  1.
                  ----------------->
                              ||||||
                              <-----------------------
                  
                  2.
                  ----------------->
                     |||||||||||||||
                     <-----------------------
                  Then, if margin=0, 2. would be selected, since this is "the longer overlap", right? 1. would give "the longest final fragment size" (I prefer to use the term "fragment", as "insert" is often used ambiguously). For my use, I would then try to control overlap with the minoverlap0 and minoverlapinsert parameters and leave the margin at 1 or 2. I can pre-trim the reads and discard too short ones quite easily. If some reads by chance end up merged as 1. above, these can be well handled downstream in the analysis pipeline.
                  Actually, I was verifying the exact behavior when "margin=0", and found a bug that favored shorter overlaps. I fixed that now in the latest release (34.27). So yes, now it behaves as in your example. Additionally, there is an entropy test which you can disable (by setting "entropy=f") which will increase the minoverlap value for low-complexity reads. It's enabled by default because it increases accuracy, but it interfered with my testing of reads that had multiple valid overlaps because those were, of course, low complexity reads.

                  The problem with synthethic data is that it isn't easy to predict sequence complexity. I will try to use some data for which I have predictable outcome... The nice thing with amplicon data is that fragment size and read length are well defined parameters.

                  Previously we've been using FLASh to merge read pairs, which seems to work fine for most of our applications, especially with HiSeq data. With long-overlap MiSeq data (2x300 bp reads of 150-500 bp fragments) we sometime have to use some irritating workarounds and filters to cope with some pecularities of FLASh (e.g. no dovetailing allowed) - and we've seen a bit high sequence complexity coming out of the amplicons, leading us to think that mismatching base pairs aren't handled correctly or not robustly. I'll run side-by-side comparisons with the BBmerge data to see whether it makes more sense.
                  I would love to see the results of an independent comparison, if you do one!

                  Another thing: would it be possible to "debug" ambiguous cases - e.g. getting the sequence identifiers and all suggested insert sizes would be of great use to find the best parameters. If there are some "hooks" i the arguments or source code, let me know...
                  In v34.27 I also enabled the "verbose" flag. Just add it to the command line, along with "t=1" (to restrict threads to 1 so that the verbose output will be in order) and don't run it in jni mode. This will show every overlap and the scores that are generated for that overlap - it may help, especially if you look at the source code - bbmap/current/jgi/BBMergeOverlapper.java function mateByOverlapJava_unrolled().

                  Oh - I also documented the "mismatches" setting. That controls the maximum number of mismatches that are allowed in an overlap. Default is 3, but like "margin" and many other parameters, it gets changed if you set the "fast", "loose", etc flags. Setting any parameter explicitly will override the settings incurred by fast/loose/etc.
                  Last edited by Brian Bushnell; 01-09-2015, 11:37 AM.

                  Comment


                  • #39
                    Hi Brian,

                    I used BBmerge to merge my 2x300bp reads last month, but now I need to determine the size distribution of my merged reads. Is there a way to do this with BBmerge? I know how to determine the insert size of the PE reads before merging, but not after merging.

                    Thanks,
                    Marisa

                    Comment


                    • #40
                      Hi Marisa,

                      BBMerge will only calculate the insert size distribution for paired reads, but the insert size of the merged reads will be equal to their actual length. So you can calculate the distribution after merging like this:

                      readlength.sh in=merged.fastq out=hist.txt

                      -Brian

                      Comment


                      • #41
                        Originally posted by Brian Bushnell View Post
                        Hi Marisa,

                        BBMerge will only calculate the insert size distribution for paired reads, but the insert size of the merged reads will be equal to their actual length. So you can calculate the distribution after merging like this:

                        readlength.sh in=merged.fastq out=hist.txt

                        -Brian

                        Thanks Brian!

                        Comment


                        • #42
                          A colleague is attempting to detect low-frequency (<0.1%) mutations in plasmid amplicons by PE sequencing and error correction in the overlap. I thought BBMerge might be suitable for this application, and have a couple of questions:

                          1) how does BBMerge treat base-call discrepancies in the overlap?
                          2) for discrepancies, is there a method to impose the reference base on the merge, assuming one of the two reads matches the reference? I suppose this could be handled at the alignment or consensus-building step, if the discrepancies are flagged in some way.

                          Thanks Brian,
                          Harold

                          Comment


                          • #43
                            Hi Harold,

                            BBMerge will reject overlaps that have too many mismatches; by default that number is 3, though you can override it. When there is a mismatch, the base chosen is the one with the higher quality value, or N if they are equal. The resulting quality value of overlapping bases b1 and b2 with qualities q1 and q2:

                            if b1 equals b2:
                            q=max(q1, q2)+min(q1, q2)/4
                            (capped at 41 by default, though this can be overridden)

                            if b1 does not equal b2:
                            q=max(q1, q2)-min(q1, q2)

                            As a result, the quality value will be very low if the bases did not match, so any resulting erroneous variant calls would also be very low quality, and thus should be easy to filter.

                            There is currently no provision for using a reference base where the reads don't match, as BBMerge does not take the reference into account.

                            Comment


                            • #44
                              Originally posted by Brian Bushnell View Post
                              Hi Harold,
                              When there is a mismatch, the base chosen is the one with the higher quality value, or N if they are equal. The resulting quality value of overlapping bases b1 and b2 with qualities q1 and q2:

                              if b1 equals b2:
                              q=max(q1, q2)+min(q1, q2)/4
                              (capped at 41 by default, though this can be overridden)
                              Hi Brian,
                              That is interesting. Any reason you chose to add only 1/4th the quality value of one of the pair? The old phrap method was to just add the quality values together if bases from different strands agreed.

                              That seemed reasonable enough -- if chance of one basecall being wrong was 1 in 100 (QV 20) but the basecall from the other strand agreed and also was QV 20, then the chance they are both wrong seems like it should be 1 in 10,000 (QV 40). Right?

                              --
                              Phillip

                              Comment


                              • #45
                                Hi Brian,

                                Thanks for the prompt response. Given the lower q scores of disconcordant basecalls, I can easily filter those from consideration.

                                Take care,
                                Harold

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                71 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                80 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X