SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Single end read with paired end reads tahamasoodi Bioinformatics 2 01-16-2016 08:46 AM
MetaSim: why paired end reverse read is much shorter than forward read?? gen_argentino Bioinformatics 0 09-06-2012 07:38 AM
Average Read Coverage for 454 paired end read data lisa1102 Core Facilities 8 10-18-2011 09:40 AM
Difference in paired-end and single-end read ? darshan Bioinformatics 1 10-01-2009 12:44 AM

Reply
 
Thread Tools
Old 06-23-2014, 07:15 PM   #21
woodydon
Member
 
Location: https://www.researchgate.net/profile/Woody_Lin

Join Date: Jan 2010
Posts: 52
Wink

Quote:
Originally Posted by Brian Bushnell View Post
Woody,

bbmerge does not do error-correction; for that, you need to use the script "ecc.sh", e.g.

ecc.sh in=reads.fq out=corrected.fq

And as Genomax implied, all of my programs are sensitive to file name extensions. If you say "out=reads.fasta" you'll get fasta; if you say "out=reads.sam.gz", you'll get gzipped sam, and so forth.
I see~ thanks for GenoMax's and your explanations!

Woody
woodydon is offline   Reply With Quote
Old 06-24-2014, 04:06 PM   #22
woodydon
Member
 
Location: https://www.researchgate.net/profile/Woody_Lin

Join Date: Jan 2010
Posts: 52
Question

Quote:
Originally Posted by Brian Bushnell View Post
Woody,

bbmerge does not do error-correction; for that, you need to use the script "ecc.sh", e.g.

ecc.sh in=reads.fq out=corrected.fq

And as Genomax implied, all of my programs are sensitive to file name extensions. If you say "out=reads.fasta" you'll get fasta; if you say "out=reads.sam.gz", you'll get gzipped sam, and so forth.
Hi Brain,

I used ecc.sh to correct my raw .fq files and then use bbmerge.sh again. I found the assemble rate is much lower:

Code:
Reads:          136095860
Joined:         13084199        9.614%
Ambiguous:      31319           0.023%
No Solution:    122968112       90.354%
Too Short:      12230           0.009%
Avg Insert:                     120.9
Standard Deviation:             12.9

Insert range:           26 - 138
90th percentile:        134
50th percentile:        123
10th percentile:        107
Without ecc.sh, the joined reads were about 30~40%. Now, it is only 9%...

Woody
woodydon is offline   Reply With Quote
Old 06-24-2014, 04:20 PM   #23
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Woody,

That's very strange. The only explanation I can come up with is that the order of the reads is getting changed so read pairs are no longer always together.

If you ran ecc.sh twice on different files, that would make sense; unless you add the "ordered" flag, the reads will not necessarily come out in the same order. But it's best to correct all the reads together so that there is more coverage. So, assuming your reads are in two files (rather than interleaved), you should run it like this:

ecc.sh in1=read1.fq in2=read2.fq out1=corrected1.fq out2=corrected2.fq

And if your reads are interleaved in 1 file, you can add the "int=t" flag to ensure the reads are treated as interleaved. This is autodetected but the autodetection will fail if the read names are different than expected.

ecc.sh in=reads.fq out=corrected.fq int=t

If you do not think that was the problem, then please post the command line and standard out / standard error output of ecc. Thanks!
Brian Bushnell is offline   Reply With Quote
Old 06-24-2014, 06:41 PM   #24
woodydon
Member
 
Location: https://www.researchgate.net/profile/Woody_Lin

Join Date: Jan 2010
Posts: 52
Default

My command was:

Code:
~/bbmap/ecc.sh threads=8 -Xmx14g in=~/reads/n_R1
.fastq out=~/reads/n_eccR1.fastq
Ecc.sh's log:

Code:
Settings:
threads:                8
k:                      31
deterministic:          false
toss error reads:       false
passes:                 1
bits per cell:          16
cells:                  3147.19M
hashes:                 3
prefilter bits:         2
prefilter cells:        13.56B
prefilter hashes:       2
base min quality:       5
kmer min prob:          0.5

target depth:           40
min depth:              6
max depth:              40
min good kmers:         15
depth percentile:       54.0
ignore dupe kmers:      true
fix spikes:             false

Changed from ASCII-33 to ASCII-64 on input quality 94 while prescanning.
Made prefilter:         hashes = 2       mem = 3.15 GB          cells = 13.54B 
       used = 3.739%
Made hash table:        hashes = 3       mem = 5.86 GB          cells = 3144.33M   
     used = 6.197%

Estimated kmers of depth 1-3:   203133322
Estimated kmers of depth 4+ :   54945575
Estimated unique kmers:         258078898

Table creation time:            1887.618 seconds.
Started output threads.
Table read time:                1510.451 seconds.       6757.71 kb/sec
Total reads in:                 136095860       100.000% Kept
Total bases in:                 10207189500     100.000% Kept
Error reads in:                 12059846        8.861%
Error type 1:                   3939140         2.894%
Error type 2:                   6350473         4.666%
Error type 3:                   9031666         6.636%

During Error Correction:
Errors Suspected:               31378616
Errors Corrected:               15385594
Errors Marked:                  0

Total kmers counted:            6085312047
Total unique kmer count:        408064601

Includes forward kmers only.
The unique kmer estimate can be more accurate than the unique count, if the tables are
 very full.
The most accurate value is the greater of the two.

Percent unique:                  6.71%
Depth average:                  14.91   (unique kmers)
Depth median:                   1       (unique kmers)
Depth standard deviation:       535.10  (unique kmers)

Depth average:                  6532.20 (all kmers)
Depth median:                   2141    (all kmers)
Depth standard deviation:       26317.70        (all kmers)

Approx. read depth median:      3568.33

Total time:                     3399.627 seconds.       3002.44 kb/sec
Hope it helps.

Woody
woodydon is offline   Reply With Quote
Old 06-25-2014, 10:05 AM   #25
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Woody,

Yep, the problem was error-correcting the two files independently, which will reorder the reads on a multithreaded machine. So, this command will solve it:

ecc.sh threads=8 -Xmx14g in=~/reads/n_R1.fastq in2=~/reads/n_R2.fastq out=~/reads/n_eccR1.fastq out2=~/reads/n_eccR2.fastq
Brian Bushnell is offline   Reply With Quote
Old 07-09-2014, 12:48 PM   #26
salamay
Member
 
Location: canada

Join Date: May 2014
Posts: 20
Default

Hi Brian,

Could you try benchmarking against SeqPreP (https://github.com/jstjohn/SeqPrep)?
salamay is offline   Reply With Quote
Old 07-09-2014, 06:15 PM   #27
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

salamay,

I will certainly add it to my list of things to do
Brian Bushnell is offline   Reply With Quote
Old 07-15-2014, 04:24 PM   #28
Elsie
Member
 
Location: Australia

Join Date: Mar 2011
Posts: 85
Default

This is a very nice, very useful thread, thank you, and thank you Brian for the details on BBmerge. I have just started using PEAR so it was good to read these comments.
Elsie is offline   Reply With Quote
Old 10-06-2014, 05:53 AM   #29
zhangjiajie
Junior Member
 
Location: Heidelberg

Join Date: Jul 2012
Posts: 1
Default

Hi,
I am one of the authors of PEAR. I must say this is a very unfair benchmark for PEAR. If you really want to show your program is better, please run PEAR with the default parameters like this:
pear -f r1.fq -r r2.fq -o outpear

The statistical test in an integrated part of PEAR, it makes no sense to switch it off. Also, you should not limit the minimal merge length in PEAR.
zhangjiajie is offline   Reply With Quote
Old 10-06-2014, 11:40 AM   #30
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I tried to make the comparison as fair as possible, but I will retest this with your suggested parameters and post the results sometime later this week.
Brian Bushnell is offline   Reply With Quote
Old 11-17-2014, 08:42 AM   #31
ikauser
Junior Member
 
Location: Chicago

Join Date: Feb 2013
Posts: 1
Default

Hi Brian,

My apologies is I missed this, does this tool allow for stitching PE reads without an insert?
ikauser is offline   Reply With Quote
Old 11-17-2014, 10:13 AM   #32
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I don't understand what you mean by "without an insert". BBMerge will stitch together PE reads that overlap only. However, there's another program in BBTools with the shellscript "bbmergegapped.sh" that can join reads that are nonoverlapping, if you have sufficient coverage, and if the insert size is not too big. It uses more memory, though. For example:

bbmergegapped.sh -Xmx31g in=reads.fq out=merged.fq gap=50

This will join reads that are overlapping, and nonoverlapping with a gap in the middle of up to a little under 50bp. So, for example, with 2x150bp reads, bbmerge with the default settings will join reads with an insert size of up to 287bp (13bp overlap), while bbmergegapped with "gap=50" will join reads with an insert size of up to 341bp (41bp gap). But the nonoverlapping reads will only be joined if there is sufficient coverage (at least 10x or so) in your input reads.

It's also possible for BBMerge to join reads based on coordinates from mapping, though that's a little more complicated.
Brian Bushnell is offline   Reply With Quote
Old 01-07-2015, 04:10 AM   #33
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default JNI error in loose/vloose mode

Hi Brian,

I'm testing BBmerge for our pipelines and am scanning the arguments a bit. I've compiled the JNI C libraries and everything runs fine with the normal settings, however when I try the "loose" or "vloose" modes I get UnsatisfiedLink errors:

Code:
Exception in thread "Thread-8" java.lang.UnsatisfiedLinkError: jgi.BBMergeOverlapper.mateByOverlapJNI([B[B[B[B[IIIIIIII)I
        at jgi.BBMergeOverlapper.mateByOverlapJNI(Native Method)
The same command line works fine when I skip "usejni=t", or with the strict, fast and normal modes (these work with/without JNI, loose/vloose only work without JNI). Did I overlook something in the documentation, or is this a real bug?

Thanks,
Samuel
sarvidsson is offline   Reply With Quote
Old 01-07-2015, 05:14 AM   #34
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default The "margin" cmd line argument

I'm scanning the command line arguments for optimal merging of the reads I have. They are amplicons (16S) with a rather well-defined fragment size (440-470), and the libraries I merge are known to be very clean (few PCR by-products), so they should merge very well (> 95 % have the correct size, but sequencing quality on R2 varies according to the MiSeq V3 chemistry lot number). I notice a considerable difference when I modify the "margin" parameter (i.e. merge rate goes from 52 % to 98 % when I lower margin to zero).

Could you give a clear description on the heuristics controlled by this parameter? The documentation give no real clue, unfortunately...

I'm not afraid about ambiguous merges (length-wise) - these can be pretty well controlled by filtering on the final length. However, I'm more concerned about correctly eliminating sequencing errors from R2 (and I'd like to find parameters rather robust to varying quality from R2). I would prefer not to run any k-mer error correction beforehand, to avoid skewing the true sequence diversity on the lower k-mer spectrum (these are not single species shotgun sequences). In my test runs, the loose/vloose and trimming parameters have low impact on the merge rates (up to 10 %). The margin parameter, however, has an extreme impact.

Last edited by sarvidsson; 01-07-2015 at 05:15 AM. Reason: clarification on ambiguous merges
sarvidsson is offline   Reply With Quote
Old 01-07-2015, 12:10 PM   #35
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by sarvidsson View Post
Hi Brian,

I'm testing BBmerge for our pipelines and am scanning the arguments a bit. I've compiled the JNI C libraries and everything runs fine with the normal settings, however when I try the "loose" or "vloose" modes I get UnsatisfiedLink errors:

Code:
Exception in thread "Thread-8" java.lang.UnsatisfiedLinkError: jgi.BBMergeOverlapper.mateByOverlapJNI([B[B[B[B[IIIIIIII)I
        at jgi.BBMergeOverlapper.mateByOverlapJNI(Native Method)
The same command line works fine when I skip "usejni=t", or with the strict, fast and normal modes (these work with/without JNI, loose/vloose only work without JNI). Did I overlook something in the documentation, or is this a real bug?

Thanks,
Samuel
Hi Samuel,

That was indeed a bug; thanks for finding it! It's fixed now in v34.25, which I've uploaded to SourceForge.

Quote:
I'm scanning the command line arguments for optimal merging of the reads I have. They are amplicons (16S) with a rather well-defined fragment size (440-470), and the libraries I merge are known to be very clean (few PCR by-products), so they should merge very well (> 95 % have the correct size, but sequencing quality on R2 varies according to the MiSeq V3 chemistry lot number). I notice a considerable difference when I modify the "margin" parameter (i.e. merge rate goes from 52 % to 98 % when I lower margin to zero).

Could you give a clear description on the heuristics controlled by this parameter? The documentation give no real clue, unfortunately...

I'm not afraid about ambiguous merges (length-wise) - these can be pretty well controlled by filtering on the final length. However, I'm more concerned about correctly eliminating sequencing errors from R2 (and I'd like to find parameters rather robust to varying quality from R2). I would prefer not to run any k-mer error correction beforehand, to avoid skewing the true sequence diversity on the lower k-mer spectrum (these are not single species shotgun sequences). In my test runs, the loose/vloose and trimming parameters have low impact on the merge rates (up to 10 %). The margin parameter, however, has an extreme impact.
The way BBMerge works is to look at all possible overlaps (within the bounds specified by minoverlap0 [the number of overlapping bases for the longest possible insert size] and minoverlapinsert [the number of overlapping bases for the shortest possible insert size]). Those are confusingly named, so let me clarify - if minoverlap0=10 and minoverlapinsert=20, then the insert size range examined, for 100bp reads, will be from 190 (10bp overlap) down to 20 (20bp overlap). Within those possible overlaps, only the ones between minoverlap and mininsert will actually be considered as valid. So in this example, if minoverlap=15 and mininsert=25, then insert sizes from 190 to 20 will be tested, but only insert sizes from 185 to 25 will be potentially kept.

So, an insert size of 192 would never even be considered; an insert size of 187 would be considered, but if it was determined to be the best overlap, it would be rejected as being too short for a high confidence merge; and an insert size of 182 would be considered, and if it was the best, accepted.

Therefore, if you know your expected insert size range to be well-defined, you can increase the merge rate by restricting these values. For example, if you have 2x100bp reads and expect all of the insert sizes to range from 140 to 160, you could set "minoverlap0=20 minoverlap=30 minoverlapinsert=130 mininsert=120". That would restrict the insert range that is examined to 120-180, and the range that is potentially accepted to 130-170. The more you can restrict the range, the less likely that you will get spurious overlaps, and the higher the merge rate because there is less chance of ambiguity.

However, note that the parameters specifically describe overlap lengths rather than insert sizes. That's because the due to trimming, the insert size and overlap length may no longer be correlated; and if overlap detection fails due to too many mismatches, the reads will be automatically quality-trimmed and overlapping will be re-attempted. You can disable that with the "trimonfailure" flag. So in other words, these parameters are only strictly correlated to insert size if all reads are the same length and all trimming is disabled. Still, if your pairs do not have really short overlaps (insert sizes near double read length) , you can increase both the merge rate and accuracy by increasing minoverlap and minoverlap0.

Ambiguous overlaps - those in which the best and second-best overlap have a similar number of mismatches - are rejected. Whether or not an overlap is ambiguous is determined by the "margin" setting. If you set margin to 1, and the best overlap has to have at least 1 fewer mismatches than the second best. If you set margin to 0, then two overlaps can have the same number of mismatches, and the longer overlap will be considered best, but this will generate more false positives.

You can empirically test the true and false positive merge rate using synthetic data, which is how I determined the default parameters; the actual optimal parameters, of course, depend on the insert size distribution, quality, read lengths, and sequence complexity of your input data.
Brian Bushnell is offline   Reply With Quote
Old 01-08-2015, 01:26 AM   #36
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Brian, thanks for the detailed description, that helps.

Quote:
Originally Posted by Brian Bushnell View Post
Ambiguous overlaps - those in which the best and second-best overlap have a similar number of mismatches - are rejected. Whether or not an overlap is ambiguous is determined by the "margin" setting. If you set margin to 1, and the best overlap has to have at least 1 fewer mismatches than the second best. If you set margin to 0, then two overlaps can have the same number of mismatches, and the longer overlap will be considered best, but this will generate more false positives.
Alright, so if I get that correctly, a read pair with two possible perfect match overlaps (both at least minoverlap0, and both giving fragments of at least minoverlapinsert), like the following:

Code:
1.
----------------->
            ||||||
            <-----------------------

2.
----------------->
   |||||||||||||||
   <-----------------------
Then, if margin=0, 2. would be selected, since this is "the longer overlap", right? 1. would give "the longest final fragment size" (I prefer to use the term "fragment", as "insert" is often used ambiguously). For my use, I would then try to control overlap with the minoverlap0 and minoverlapinsert parameters and leave the margin at 1 or 2. I can pre-trim the reads and discard too short ones quite easily. If some reads by chance end up merged as 1. above, these can be well handled downstream in the analysis pipeline.

Quote:
Originally Posted by Brian Bushnell View Post
You can empirically test the true and false positive merge rate using synthetic data, which is how I determined the default parameters; the actual optimal parameters, of course, depend on the insert size distribution, quality, read lengths, and sequence complexity of your input data.
The problem with synthethic data is that it isn't easy to predict sequence complexity. I will try to use some data for which I have predictable outcome... The nice thing with amplicon data is that fragment size and read length are well defined parameters.

Previously we've been using FLASh to merge read pairs, which seems to work fine for most of our applications, especially with HiSeq data. With long-overlap MiSeq data (2x300 bp reads of 150-500 bp fragments) we sometime have to use some irritating workarounds and filters to cope with some pecularities of FLASh (e.g. no dovetailing allowed) - and we've seen a bit high sequence complexity coming out of the amplicons, leading us to think that mismatching base pairs aren't handled correctly or not robustly. I'll run side-by-side comparisons with the BBmerge data to see whether it makes more sense.

Last edited by sarvidsson; 01-08-2015 at 01:35 AM. Reason: Corrected the example overlaps
sarvidsson is offline   Reply With Quote
Old 01-08-2015, 01:31 AM   #37
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Another thing: would it be possible to "debug" ambiguous cases - e.g. getting the sequence identifiers and all suggested insert sizes would be of great use to find the best parameters. If there are some "hooks" i the arguments or source code, let me know...
sarvidsson is offline   Reply With Quote
Old 01-09-2015, 11:34 AM   #38
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by sarvidsson View Post
Alright, so if I get that correctly, a read pair with two possible perfect match overlaps (both at least minoverlap0, and both giving fragments of at least minoverlapinsert), like the following:

Code:
1.
----------------->
            ||||||
            <-----------------------

2.
----------------->
   |||||||||||||||
   <-----------------------
Then, if margin=0, 2. would be selected, since this is "the longer overlap", right? 1. would give "the longest final fragment size" (I prefer to use the term "fragment", as "insert" is often used ambiguously). For my use, I would then try to control overlap with the minoverlap0 and minoverlapinsert parameters and leave the margin at 1 or 2. I can pre-trim the reads and discard too short ones quite easily. If some reads by chance end up merged as 1. above, these can be well handled downstream in the analysis pipeline.
Actually, I was verifying the exact behavior when "margin=0", and found a bug that favored shorter overlaps. I fixed that now in the latest release (34.27). So yes, now it behaves as in your example. Additionally, there is an entropy test which you can disable (by setting "entropy=f") which will increase the minoverlap value for low-complexity reads. It's enabled by default because it increases accuracy, but it interfered with my testing of reads that had multiple valid overlaps because those were, of course, low complexity reads.

Quote:
The problem with synthethic data is that it isn't easy to predict sequence complexity. I will try to use some data for which I have predictable outcome... The nice thing with amplicon data is that fragment size and read length are well defined parameters.

Previously we've been using FLASh to merge read pairs, which seems to work fine for most of our applications, especially with HiSeq data. With long-overlap MiSeq data (2x300 bp reads of 150-500 bp fragments) we sometime have to use some irritating workarounds and filters to cope with some pecularities of FLASh (e.g. no dovetailing allowed) - and we've seen a bit high sequence complexity coming out of the amplicons, leading us to think that mismatching base pairs aren't handled correctly or not robustly. I'll run side-by-side comparisons with the BBmerge data to see whether it makes more sense.
I would love to see the results of an independent comparison, if you do one!

Quote:
Another thing: would it be possible to "debug" ambiguous cases - e.g. getting the sequence identifiers and all suggested insert sizes would be of great use to find the best parameters. If there are some "hooks" i the arguments or source code, let me know...
In v34.27 I also enabled the "verbose" flag. Just add it to the command line, along with "t=1" (to restrict threads to 1 so that the verbose output will be in order) and don't run it in jni mode. This will show every overlap and the scores that are generated for that overlap - it may help, especially if you look at the source code - bbmap/current/jgi/BBMergeOverlapper.java function mateByOverlapJava_unrolled().

Oh - I also documented the "mismatches" setting. That controls the maximum number of mismatches that are allowed in an overlap. Default is 3, but like "margin" and many other parameters, it gets changed if you set the "fast", "loose", etc flags. Setting any parameter explicitly will override the settings incurred by fast/loose/etc.

Last edited by Brian Bushnell; 01-09-2015 at 11:37 AM.
Brian Bushnell is offline   Reply With Quote
Old 01-20-2015, 02:58 PM   #39
Marisa_Miller
Member
 
Location: St. Paul, MN

Join Date: Aug 2010
Posts: 34
Default

Hi Brian,

I used BBmerge to merge my 2x300bp reads last month, but now I need to determine the size distribution of my merged reads. Is there a way to do this with BBmerge? I know how to determine the insert size of the PE reads before merging, but not after merging.

Thanks,
Marisa
Marisa_Miller is offline   Reply With Quote
Old 01-20-2015, 03:57 PM   #40
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Marisa,

BBMerge will only calculate the insert size distribution for paired reads, but the insert size of the merged reads will be equal to their actual length. So you can calculate the distribution after merging like this:

readlength.sh in=merged.fastq out=hist.txt

-Brian
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bbmap, bbmerge, bbtools, flash, pair end

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO