Unconfigured Ad

**tshalev** · 12-15-2016, 02:12 PM

@Brian Bushnell

Ah OK, I see. So it "expects" to not find the adapter sequence there, since it has hopefully been removed by BBDuk. Slightly unrelated, I am using RNA-Seq data for a coniferous tree species, and am assembling using conventional assemblers such as Trinity, Velvet-Oases, etc. BBMerge is appropriate for this purpose, right? I keep noticing a lot of threads talking about 16S data, or amplicon data, and I haven't even heard of the assemblers that you mentioned

.

Thanks!

**Brian Bushnell** · 12-15-2016, 02:26 PM

The primary reason people use read merging is for 16S or other amplicon analyses, I believe. But I don't personally work with 16S very often, so BBMerge is designed and optimized for improving assemblies. Of course, it works on 16S as well, but I use it to optimize assembly pipelines. That said, I have never used Trinity, so I don't know how it would affect a Trinity assembly. As long as you assemble with both the merged and unmerged reads, most assemblers benefit from BBMerge (some quite a lot) so I would expect it to improve a Trinity assembly, but I'd be interested to hear what you experience, if you have the time and interest to run Trinity both ways. Of course RNA-seq assembly quality is especially hard to measure, but metrics like mapping rate, N50, and size are at least somewhat useful.

What kind of assembly are you doing, how long are your reads, and what organism? Is it just RNA-seq?

**tshalev** · 12-15-2016, 02:47 PM

@Brian Bushnell

I am working with foliage tissue from a species of coniferous tree. I'm using 100bp reads, on Illumina HiSeq 4000. I did actually do the comparison tests about a year ago for using merging vs. not merging, on some different data that I had. For these I trimmed first using Trimmomatic though and did not use kmer information or adapter recognition (not sure if these were implemented in BBMerge back then).

My overall consensus was that merging and then assembling with both merged and unmerged reads produced better assemblies than not merging, over four different assemblers (Trinity, Velvet+Oases, SOAPdenovoTrans and transABySS). This was gauged using the optimized assembly score from Transrate, as well as by assembly completeness as measured by BUSCO and and contiguity as measured by Conditional Reciprocal Best BLAST (from Transrate) against gene sets of other conifer species. In all cases the gains were enough to warrant the use of merging.

I'm interested to see now how using some of these new features will affect my assembly. I already see an increase in merging rate from about ~57% to ~83% using the verystrict parameter, although I won't know whether this includes false positives until I assemble. Regarding adapters expected versus adapters found, I'm seeing ~430000 adapters expected versus ~6000 adapters found in ~91.5 million reads after adapter trimming, so I guess this is good?

**Brian Bushnell** · 12-15-2016, 03:20 PM

Originally posted by tshalev View Post

My overall consensus was that merging and then assembling with both merged and unmerged reads produced better assemblies than not merging, over four different assemblers (Trinity, Velvet+Oases, SOAPdenovoTrans and transABySS). This was gauged using the optimized assembly score from Transrate, as well as by assembly completeness as measured by BUSCO and and contiguity as measured by Conditional Reciprocal Best BLAST (from Transrate) against gene sets of other conifer species. In all cases the gains were enough to warrant the use of merging.

Great, thanks for that info!

I'm interested to see now how using some of these new features will affect my assembly. I already see an increase in merging rate from about ~57% to ~83% using the verystrict parameter, although I won't know whether this includes false positives until I assemble.

OK, please let me know the results - it's useful for giving people guidance on when to use rem flag. I've never tried it in conjunction with RNA-seq data, just isolates, metagenomes, and single-cell, though it improved all of those cases.

Regarding adapters expected versus adapters found, I'm seeing ~430000 adapters expected versus ~6000 adapters found in ~91.5 million reads after adapter trimming, so I guess this is good?

That indicates the adapter trimming was fairly complete. What version of BBMap are you using, by the way?

**tshalev** · 12-15-2016, 04:40 PM

The latest version, release 36_62. I'll keep you posted.

**j.m.c** · 12-15-2016, 05:48 PM

Thank you for your reply.

Yes, my reads were 87 bp after trimming with trimmomatic. I had also removed adapter sequences with trimmomatic and now I think I see the issue if understood correctly what you said:

"The 35bp reads you ended up with are because of the short insert. When you have 2x87bp reads with a 35bp insert, you get 35bp of overlap on the 3' end and then 52bp of the 5' end overhanging on each side; that's adapter sequence. BBMerge trims that off so you are left with only the 35bp of genomic sequence. "

That means the overhangs are removed since BBmerge thinks they are adapter sequences. My reads are from RNA-seq data (not genomic data, I am sorry I didn't specify earlier) and since I removed adapter sequences with trimmomatic, I am actually loosing data if the 5' overhangs were trimmed off...

Is there any way to prevent that with BBmerge?

Otherwise I will try BBmerge with my raw reads without removing adapters.

Thanks!

**Brian Bushnell** · 12-15-2016, 07:36 PM

If you know your adapter sequences (or have a list of typical adapter sequences, or actually, you can just say "adapter=default"), you can do this:

Code:

bbmerge.sh in=reads.fq adapter=adapter.fa out=merged.fq outu=unmerged.fq

If BBMerge thinks that you still have untrimmed adapters in those cases... I am quite confident it is correct. Adapter-trimming programs are not perfect (nor is BBMerge or BBDuk). I recommend BBDuk for adapter-trimming because it uses both adapter sequences and overlap information (very conservatively), but you will still end up with some untrimmed reads that actually had adapters. The problem is that Illumina sequence quality declines with each cycle, so by the end of the read (the part that typically overlaps, or has adapter sequence) the error rate can be pretty high. If you use an adapter-trimming program that solely uses sequence-matching to a list of provided adapter sequences, then the high mismatch rate will yield poor adapter-trimming for low-quality reads. BBDuk with the "tbo" flag uses both adapter sequences and overlap information, which for short-insert reads, gives added weight to the high-quality initial bases in a read pair.

So - it's not surprising that Trimmomatic did not do complete trimming. I recommend you use BBDuk instead. It still won't give perfect adapter-trimming, but it will be much better than Trimmomatic.

**peerah** · 02-15-2017, 04:38 PM

Hi Brian! I have a question: I am working on a fungal ITS metagenomic amplicon library with a pretty wide variation in sizes (200-500 bp). We are doing 2x300, and my second reads are a little bit lower in quality compared to the firsts. Is there any setting on the BBMerge that I should modify in order to get the most out of the data? I'm pretty new to the field, so please let me know if you need more information! Thank you.

**Brian Bushnell** · 02-15-2017, 04:52 PM

Hi! With that range you should have a worst a 100bp overlap, which is plenty. But 2x300 MiSeq runs have had major quality problems in the past, so it's possible that trimming would help. I'd suggest adding the flags "qtrim2 trimq=10,15". This will first try to merge the reads, and if unsuccessful (because the quality was too low so there were too many mismatches) quality-trim to Q10 on the right side and retry; then if still unsuccessful do the same at Q15. This isn't necessary unless the data is quite bad, but it will generally increase your merge rate, and is better than simply quality-trimming all reads prior to merging.

**mdavrandi** · 04-11-2017, 03:37 AM

Originally posted by peerah View Post

Hi Brian! I have a question: I am working on a fungal ITS metagenomic amplicon library with a pretty wide variation in sizes (200-500 bp). We are doing 2x300, and my second reads are a little bit lower in quality compared to the firsts. Is there any setting on the BBMerge that I should modify in order to get the most out of the data? I'm pretty new to the field, so please let me know if you need more information! Thank you.

Hi Peerah,

We are having the same problem in our lab with 2x300 miseq runs- very poor Read 2 >Q30 scores- and I was wondering if Brian`s recommendation improved the number of paired-sequences you obtained from that run.

Cheers

**GenoMax** · 04-11-2017, 04:02 AM

Originally posted by mdavrandi View Post

Hi Peerah,

We are having the same problem in our lab with 2x300 miseq runs- very poor Read 2 >Q30 scores- and I was wondering if Brian`s recommendation improved the number of paired-sequences you obtained from that run.

Cheers

In case you had missed this post that has first explanation for poor read 2 scores.

**ashuchawla** · 10-10-2017, 04:11 PM

Confusion regarding read merging

Dear Brian, or anybody else who could help me,

I used the following command for BBMerge:
bbmerge.sh in=reads.fq out=merged.fq pfilter=1

I got theses stats:
Pairs: 2545201
Joined: 1491688 58.61%
Ambiguous: 439613 17.27%
No Solution: 613393 24.10%
Too Short: 0 0.00%
Avg Insert: 322.6

My questions:
1. What happens to the bases while read merging if there is a mismatch outside of the 12 bases this command considers. As per my understanding, Minimum number of overlapping bases to allow merging is 12. In other words, could you please explain exactly how does the merge happen between two paired end reads when I use the above mentioned command for a perfect overlap?

2. Could you please explain, what do "Ambiguous" and "No solution" mean?

Thank you so much,
Ashu

**Brian Bushnell** · 10-11-2017, 12:48 PM

Hi Ashu,

"Ambiguous" means there are multiple possible overlaps. For example, if read 1 and read 2 both end with "ACACACACACACACACACACAC", there are lots of possible overlap frames, none of which is particularly better than another. So, that would be ambiguous.

"No solution" means there is no overlap satisfying BBMerge's fairly strict criteria for the number of matching and mismatching bases in the best possible overlap frame.

If there is no frame in which the length, entropy (this determines the minimum necessary length), number of matching bases, and number of mismatching bases satisfy the cutoffs, the pair will not be merged and it will be declared "No solution". If there are multiple frames satisfying those cutoffs, and the second-best frame is sufficiently close to the best frame that it's really hard to tell which one is correct, the pair will not be merged and it will be declared "Ambiguous".

The pair will only be merged if there seems to be an unambiguously good solution.

"minoverlap=12" means that reads will never be merged if the best overlap is shorter than 12 bp. pfilter=1 will prevent reads from merging if there are any mismatches (I don't particularly recommend this, but it might be useful in some situations...). pfilter means probability filter, and considers the base qualities, so a read with a mismatch on a Q2 base might pass while an otherwise identical read with a mismatch in a Q40 base might fail. BBMerge will still look for all possible overlaps, and if, say, you have a 30bp overlap with 1 mismatch and a 20bp overlap with 0 mismatches, that would still be declared ambiguous.

Incidentally! The BBMerge paper was accepted by PLOS ONE and will be published soon, so you can read all the algorithmic details there =) But I don't actually know the date it will be published, so feel free to ask me more questions in the meantime if I have not sufficiently clarified things.

**ashuchawla** · 10-11-2017, 01:17 PM

Thank you Brian for your reply. I have to merge paired end reads from a Miseq run( I quality trimmed them at Q30). The overlap is around 100bp according to the experimentalist. What options would you recommend to merge these reads? Once I have the merged reads, I will use dedup to get all unique merged reads and run further analysis on them.

Ashu

Originally posted by Brian Bushnell View Post

Hi Ashu,

"Ambiguous" means there are multiple possible overlaps. For example, if read 1 and read 2 both end with "ACACACACACACACACACACAC", there are lots of possible overlap frames, none of which is particularly better than another. So, that would be ambiguous.

"No solution" means there is no overlap satisfying BBMerge's fairly strict criteria for the number of matching and mismatching bases in the best possible overlap frame.

If there is no frame in which the length, entropy (this determines the minimum necessary length), number of matching bases, and number of mismatching bases satisfy the cutoffs, the pair will not be merged and it will be declared "No solution". If there are multiple frames satisfying those cutoffs, and the second-best frame is sufficiently close to the best frame that it's really hard to tell which one is correct, the pair will not be merged and it will be declared "Ambiguous".

The pair will only be merged if there seems to be an unambiguously good solution.

"minoverlap=12" means that reads will never be merged if the best overlap is shorter than 12 bp. pfilter=1 will prevent reads from merging if there are any mismatches (I don't particularly recommend this, but it might be useful in some situations...). pfilter means probability filter, and considers the base qualities, so a read with a mismatch on a Q2 base might pass while an otherwise identical read with a mismatch in a Q40 base might fail. BBMerge will still look for all possible overlaps, and if, say, you have a 30bp overlap with 1 mismatch and a 20bp overlap with 0 mismatches, that would still be declared ambiguous.

Incidentally! The BBMerge paper was accepted by PLOS ONE and will be published soon, so you can read all the algorithmic details there =) But I don't actually know the date it will be published, so feel free to ask me more questions in the meantime if I have not sufficiently clarified things.

**GenoMax** · 10-11-2017, 03:52 PM

I quality trimmed them at Q30

That is overly strict. What type of dataset is this and do you have a reference genome available?

Topics	Statistics	Last Post
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, Yesterday, 12:03 PM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 Yesterday, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, Yesterday, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM
Scientists Solve a 25-Year Mystery in RNA Interference by SEQadmin2 Started by SEQadmin2, 05-26-2026, 10:12 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-26-2026, 10:12 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News