Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • tshalev
    Junior Member
    • Dec 2016
    • 4

    #91
    @Brian Bushnell

    Ah OK, I see. So it "expects" to not find the adapter sequence there, since it has hopefully been removed by BBDuk. Slightly unrelated, I am using RNA-Seq data for a coniferous tree species, and am assembling using conventional assemblers such as Trinity, Velvet-Oases, etc. BBMerge is appropriate for this purpose, right? I keep noticing a lot of threads talking about 16S data, or amplicon data, and I haven't even heard of the assemblers that you mentioned .

    Thanks!

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      #92
      The primary reason people use read merging is for 16S or other amplicon analyses, I believe. But I don't personally work with 16S very often, so BBMerge is designed and optimized for improving assemblies. Of course, it works on 16S as well, but I use it to optimize assembly pipelines. That said, I have never used Trinity, so I don't know how it would affect a Trinity assembly. As long as you assemble with both the merged and unmerged reads, most assemblers benefit from BBMerge (some quite a lot) so I would expect it to improve a Trinity assembly, but I'd be interested to hear what you experience, if you have the time and interest to run Trinity both ways. Of course RNA-seq assembly quality is especially hard to measure, but metrics like mapping rate, N50, and size are at least somewhat useful.

      What kind of assembly are you doing, how long are your reads, and what organism? Is it just RNA-seq?

      Comment

      • tshalev
        Junior Member
        • Dec 2016
        • 4

        #93
        @Brian Bushnell

        I am working with foliage tissue from a species of coniferous tree. I'm using 100bp reads, on Illumina HiSeq 4000. I did actually do the comparison tests about a year ago for using merging vs. not merging, on some different data that I had. For these I trimmed first using Trimmomatic though and did not use kmer information or adapter recognition (not sure if these were implemented in BBMerge back then).

        My overall consensus was that merging and then assembling with both merged and unmerged reads produced better assemblies than not merging, over four different assemblers (Trinity, Velvet+Oases, SOAPdenovoTrans and transABySS). This was gauged using the optimized assembly score from Transrate, as well as by assembly completeness as measured by BUSCO and and contiguity as measured by Conditional Reciprocal Best BLAST (from Transrate) against gene sets of other conifer species. In all cases the gains were enough to warrant the use of merging.

        I'm interested to see now how using some of these new features will affect my assembly. I already see an increase in merging rate from about ~57% to ~83% using the verystrict parameter, although I won't know whether this includes false positives until I assemble. Regarding adapters expected versus adapters found, I'm seeing ~430000 adapters expected versus ~6000 adapters found in ~91.5 million reads after adapter trimming, so I guess this is good?

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #94
          Originally posted by tshalev View Post
          My overall consensus was that merging and then assembling with both merged and unmerged reads produced better assemblies than not merging, over four different assemblers (Trinity, Velvet+Oases, SOAPdenovoTrans and transABySS). This was gauged using the optimized assembly score from Transrate, as well as by assembly completeness as measured by BUSCO and and contiguity as measured by Conditional Reciprocal Best BLAST (from Transrate) against gene sets of other conifer species. In all cases the gains were enough to warrant the use of merging.
          Great, thanks for that info!

          I'm interested to see now how using some of these new features will affect my assembly. I already see an increase in merging rate from about ~57% to ~83% using the verystrict parameter, although I won't know whether this includes false positives until I assemble.
          OK, please let me know the results - it's useful for giving people guidance on when to use rem flag. I've never tried it in conjunction with RNA-seq data, just isolates, metagenomes, and single-cell, though it improved all of those cases.

          Regarding adapters expected versus adapters found, I'm seeing ~430000 adapters expected versus ~6000 adapters found in ~91.5 million reads after adapter trimming, so I guess this is good?
          That indicates the adapter trimming was fairly complete. What version of BBMap are you using, by the way?

          Comment

          • tshalev
            Junior Member
            • Dec 2016
            • 4

            #95
            The latest version, release 36_62. I'll keep you posted.

            Comment

            • j.m.c
              Junior Member
              • Dec 2016
              • 2

              #96
              Thank you for your reply.

              Yes, my reads were 87 bp after trimming with trimmomatic. I had also removed adapter sequences with trimmomatic and now I think I see the issue if understood correctly what you said:

              "The 35bp reads you ended up with are because of the short insert. When you have 2x87bp reads with a 35bp insert, you get 35bp of overlap on the 3' end and then 52bp of the 5' end overhanging on each side; that's adapter sequence. BBMerge trims that off so you are left with only the 35bp of genomic sequence. "

              That means the overhangs are removed since BBmerge thinks they are adapter sequences. My reads are from RNA-seq data (not genomic data, I am sorry I didn't specify earlier) and since I removed adapter sequences with trimmomatic, I am actually loosing data if the 5' overhangs were trimmed off...

              Is there any way to prevent that with BBmerge?

              Otherwise I will try BBmerge with my raw reads without removing adapters.

              Thanks!

              Comment

              • Brian Bushnell
                Super Moderator
                • Jan 2014
                • 2709

                #97
                If you know your adapter sequences (or have a list of typical adapter sequences, or actually, you can just say "adapter=default"), you can do this:

                Code:
                bbmerge.sh in=reads.fq adapter=adapter.fa out=merged.fq outu=unmerged.fq
                If BBMerge thinks that you still have untrimmed adapters in those cases... I am quite confident it is correct. Adapter-trimming programs are not perfect (nor is BBMerge or BBDuk). I recommend BBDuk for adapter-trimming because it uses both adapter sequences and overlap information (very conservatively), but you will still end up with some untrimmed reads that actually had adapters. The problem is that Illumina sequence quality declines with each cycle, so by the end of the read (the part that typically overlaps, or has adapter sequence) the error rate can be pretty high. If you use an adapter-trimming program that solely uses sequence-matching to a list of provided adapter sequences, then the high mismatch rate will yield poor adapter-trimming for low-quality reads. BBDuk with the "tbo" flag uses both adapter sequences and overlap information, which for short-insert reads, gives added weight to the high-quality initial bases in a read pair.

                So - it's not surprising that Trimmomatic did not do complete trimming. I recommend you use BBDuk instead. It still won't give perfect adapter-trimming, but it will be much better than Trimmomatic.

                Comment

                • peerah
                  Junior Member
                  • Jan 2015
                  • 6

                  #98
                  Hi Brian! I have a question: I am working on a fungal ITS metagenomic amplicon library with a pretty wide variation in sizes (200-500 bp). We are doing 2x300, and my second reads are a little bit lower in quality compared to the firsts. Is there any setting on the BBMerge that I should modify in order to get the most out of the data? I'm pretty new to the field, so please let me know if you need more information! Thank you.

                  Comment

                  • Brian Bushnell
                    Super Moderator
                    • Jan 2014
                    • 2709

                    #99
                    Hi! With that range you should have a worst a 100bp overlap, which is plenty. But 2x300 MiSeq runs have had major quality problems in the past, so it's possible that trimming would help. I'd suggest adding the flags "qtrim2 trimq=10,15". This will first try to merge the reads, and if unsuccessful (because the quality was too low so there were too many mismatches) quality-trim to Q10 on the right side and retry; then if still unsuccessful do the same at Q15. This isn't necessary unless the data is quite bad, but it will generally increase your merge rate, and is better than simply quality-trimming all reads prior to merging.

                    Comment

                    • mdavrandi
                      Junior Member
                      • Apr 2017
                      • 1

                      Originally posted by peerah View Post
                      Hi Brian! I have a question: I am working on a fungal ITS metagenomic amplicon library with a pretty wide variation in sizes (200-500 bp). We are doing 2x300, and my second reads are a little bit lower in quality compared to the firsts. Is there any setting on the BBMerge that I should modify in order to get the most out of the data? I'm pretty new to the field, so please let me know if you need more information! Thank you.
                      Hi Peerah,

                      We are having the same problem in our lab with 2x300 miseq runs- very poor Read 2 >Q30 scores- and I was wondering if Brian`s recommendation improved the number of paired-sequences you obtained from that run.

                      Cheers

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        Originally posted by mdavrandi View Post
                        Hi Peerah,

                        We are having the same problem in our lab with 2x300 miseq runs- very poor Read 2 >Q30 scores- and I was wondering if Brian`s recommendation improved the number of paired-sequences you obtained from that run.

                        Cheers
                        In case you had missed this post that has first explanation for poor read 2 scores.

                        Comment

                        • ashuchawla
                          Member
                          • Jan 2012
                          • 38

                          Confusion regarding read merging

                          Dear Brian, or anybody else who could help me,

                          I used the following command for BBMerge:
                          bbmerge.sh in=reads.fq out=merged.fq pfilter=1

                          I got theses stats:
                          Pairs: 2545201
                          Joined: 1491688 58.61%
                          Ambiguous: 439613 17.27%
                          No Solution: 613393 24.10%
                          Too Short: 0 0.00%
                          Avg Insert: 322.6

                          My questions:
                          1. What happens to the bases while read merging if there is a mismatch outside of the 12 bases this command considers. As per my understanding, Minimum number of overlapping bases to allow merging is 12. In other words, could you please explain exactly how does the merge happen between two paired end reads when I use the above mentioned command for a perfect overlap?

                          2. Could you please explain, what do "Ambiguous" and "No solution" mean?

                          Thank you so much,
                          Ashu

                          Comment

                          • Brian Bushnell
                            Super Moderator
                            • Jan 2014
                            • 2709

                            Hi Ashu,

                            "Ambiguous" means there are multiple possible overlaps. For example, if read 1 and read 2 both end with "ACACACACACACACACACACAC", there are lots of possible overlap frames, none of which is particularly better than another. So, that would be ambiguous.

                            "No solution" means there is no overlap satisfying BBMerge's fairly strict criteria for the number of matching and mismatching bases in the best possible overlap frame.

                            If there is no frame in which the length, entropy (this determines the minimum necessary length), number of matching bases, and number of mismatching bases satisfy the cutoffs, the pair will not be merged and it will be declared "No solution". If there are multiple frames satisfying those cutoffs, and the second-best frame is sufficiently close to the best frame that it's really hard to tell which one is correct, the pair will not be merged and it will be declared "Ambiguous".

                            The pair will only be merged if there seems to be an unambiguously good solution.

                            "minoverlap=12" means that reads will never be merged if the best overlap is shorter than 12 bp. pfilter=1 will prevent reads from merging if there are any mismatches (I don't particularly recommend this, but it might be useful in some situations...). pfilter means probability filter, and considers the base qualities, so a read with a mismatch on a Q2 base might pass while an otherwise identical read with a mismatch in a Q40 base might fail. BBMerge will still look for all possible overlaps, and if, say, you have a 30bp overlap with 1 mismatch and a 20bp overlap with 0 mismatches, that would still be declared ambiguous.

                            Incidentally! The BBMerge paper was accepted by PLOS ONE and will be published soon, so you can read all the algorithmic details there =) But I don't actually know the date it will be published, so feel free to ask me more questions in the meantime if I have not sufficiently clarified things.

                            Comment

                            • ashuchawla
                              Member
                              • Jan 2012
                              • 38

                              Thank you Brian for your reply. I have to merge paired end reads from a Miseq run( I quality trimmed them at Q30). The overlap is around 100bp according to the experimentalist. What options would you recommend to merge these reads? Once I have the merged reads, I will use dedup to get all unique merged reads and run further analysis on them.

                              Ashu

                              Originally posted by Brian Bushnell View Post
                              Hi Ashu,

                              "Ambiguous" means there are multiple possible overlaps. For example, if read 1 and read 2 both end with "ACACACACACACACACACACAC", there are lots of possible overlap frames, none of which is particularly better than another. So, that would be ambiguous.

                              "No solution" means there is no overlap satisfying BBMerge's fairly strict criteria for the number of matching and mismatching bases in the best possible overlap frame.

                              If there is no frame in which the length, entropy (this determines the minimum necessary length), number of matching bases, and number of mismatching bases satisfy the cutoffs, the pair will not be merged and it will be declared "No solution". If there are multiple frames satisfying those cutoffs, and the second-best frame is sufficiently close to the best frame that it's really hard to tell which one is correct, the pair will not be merged and it will be declared "Ambiguous".

                              The pair will only be merged if there seems to be an unambiguously good solution.

                              "minoverlap=12" means that reads will never be merged if the best overlap is shorter than 12 bp. pfilter=1 will prevent reads from merging if there are any mismatches (I don't particularly recommend this, but it might be useful in some situations...). pfilter means probability filter, and considers the base qualities, so a read with a mismatch on a Q2 base might pass while an otherwise identical read with a mismatch in a Q40 base might fail. BBMerge will still look for all possible overlaps, and if, say, you have a 30bp overlap with 1 mismatch and a 20bp overlap with 0 mismatches, that would still be declared ambiguous.

                              Incidentally! The BBMerge paper was accepted by PLOS ONE and will be published soon, so you can read all the algorithmic details there =) But I don't actually know the date it will be published, so feel free to ask me more questions in the meantime if I have not sufficiently clarified things.

                              Comment

                              • GenoMax
                                Senior Member
                                • Feb 2008
                                • 7142

                                I quality trimmed them at Q30
                                That is overly strict. What type of dataset is this and do you have a reference genome available?

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  Yesterday, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Yesterday, 12:03 PM
                                0 responses
                                19 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, Yesterday, 11:40 AM
                                0 responses
                                14 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-26-2026, 10:12 AM
                                0 responses
                                31 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...