SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Single end read with paired end reads tahamasoodi Bioinformatics 2 01-16-2016 07:46 AM
MetaSim: why paired end reverse read is much shorter than forward read?? gen_argentino Bioinformatics 0 09-06-2012 06:38 AM
Average Read Coverage for 454 paired end read data lisa1102 Core Facilities 8 10-18-2011 08:40 AM
Difference in paired-end and single-end read ? darshan Bioinformatics 1 09-30-2009 11:44 PM

Reply
 
Thread Tools
Old 06-04-2014, 03:03 PM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default Introducing BBMerge: A paired-end read merger

I'd like to introduce a new read-pairing tool, BBMerge, and present a comparison with some similar tools, FLASH, PEAR, COPE, and fastq-join. BBMerge has actually existed since Aug. 2012 but I only recently got around to benchmarking it.

First, what does BBMerge do? Illumina paired reads can be intentionally generated with an insert size (length of template molecule) that is shorter than the sum of the lengths of read 1 and read 2. For example, at JGI, the bulk of our data is 2x150bp fragment libraries designed to have an average insert size of 270bp. Because they overlap, the two reads to be merged into a single, longer read. Single reads are easier to assemble, and longer reads give a better assembly; furthermore, the merging process can detect or correct errors where the reads overlap, yielding a lower error rate than the raw reads. But if the reads are merged incorrectly – yielding a single read with a different length than the actual insert size of the original molecule – errors will be added. So for this process to be useful, it is crucial that reads be merged conservatively enough that the advantages outweigh the disadvantages from incorrect merges.

On to the comparison. I generated 4 million synthetic pairs of 2x150bp reads from E.coli, with errors mimicking the Illumina quality profile. The insert size mean was 270bp and the standard deviation was 30bp, with the min and max insert sizes capped at 3 standard deviations. The exact command was:

randomreads.sh ref=ecoli_K12.fa reads=4000000 paired interleaved out=reads.fq.gz zl=6 len=150 mininsert=180 maxinsert=360 bell maxq=38 midq=28 minq=18 qvariance=16 ow nrate=0.05 maxns=2 maxnlen=1

Then I renamed these reads with names indicating their insert size, e.g. “insert=191” for a pair with insert size 191bp:

bbrename.sh in=reads.fq.gz int renamebyinsert out1=r1.fq out2=r2.fq

Next I merged the reads with each program, varying the stringency along one parameter.

The FLASH command varied the –x parameter from 0.01 (the min allowed) through 0.25 (the default):

/path/FLASH -M 150 -x 0.25 r1.fq r2.fq 1>FLASH.log

The BBMerge command used 5 different flags in terms of decreasing stringency, “strict”, “fast”, default, “loose”, “vloose”:

java -da -Xmx400m -cp /path/ jgi.BBMerge in1=r1.fq in2=r2.fq out=x.fq minoi=150

The PEAR command varied the p-value from 0.0001 to 1 (note that only 5 values were allowed by the program):

pear -n 150 -p 1 -f r1.fq -r r2.fq -o outpear

The COPE command varied the -c (minimum match ratio parameter) from 1.0 (no mismatches) to 0.75 (the default):

./cope -a /dev/shm/r1.fq -b /dev/shm/r2.fq -o /dev/shm/cmerged.fq -2 /dev/shm/r1_unconnected.fq -3 /dev/shm/r2_unconnected.fq -m 0 -s 33 -c 1.0 -u 150 -N 0 1>cope.o 2>&1

The fastq-join (from ea-utils) command line swept -p from 0 to 25 and -m from 4 to 12 parameters; only the -p sweep is shown, as it gave better results. Sample command:

fastq-join r1.fq r2.fq -o fqj -p 0

All programs (except fastq-merge, which does not support the option) were capped at not allowing insert sizes shorter than 150bp. This is because FLASH (which appears to be currently the most popular merger) does not seem to be capable of handling reads with insert sizes shorter than read length. If FLASH is forced to allow such overlaps (using the –O flag) then the results it produces from such reads are 100% wrong, resulting in performance that seems bad to the point of being hard to believe, which would undermine this analysis, so I limited the range of insert sizes to at minimum 180, which FLASH can handle. The output was graded by my grading script to determine the percent correct and incorrect:

grademerge.sh in=x.fq

Then the results were plotted in a ROC curve, attached. The X-axis is logarithmic in one figure and linear in the other.




FLASH achieves a slightly high merging rate at its default sensitivity, but does so at the expense of an extremely high false positive rate. BBMerge has a systematically lower false positive rate than FLASH at any sensitivity. fastq-join yields the highest merging rate, at the cost of the highest false-positive rate, second only to PEAR. COPE is very competitive with FLASH and strictly exceeds it with "N=0"; unfortunately it is unstable at that setting. COPE is stable at "N=1" (the default) but is generally inferior to FLASH at that setting. PEAR appears to be inferior to all other options across the board.

Next, I tested the speed. Since BBMerge, FLASH, and PEAR are multithreaded, I tested with various numbers of threads to determine peak performance. The tests were run on a 4x8-core Sandy Bridge E node with no other processes running; these had hyperthreading enabled, for 32 cores and 64 threads. Input came from a ramdisk and was sent to a ramdisk, so I/O bandwidth was not limiting. The results attached to this post indicate what happens when input was from local disk and output went to a networked filesystem, but that was I/O limited; the image embedded below reflects the ramdisk run.



FLASH has a slight advantage with a low number of threads, but BBMerge ultimately scales better to reach a higher peak of 235 megabasepairs/s (Mbp/s) versus FLASH’s 160 Mbp/s. BBMerge’s peak came at 20 threads and FLASH’s at 36 threads. 235 Mbp/s in this case equates to 499 megabytes per second and BBMerge was probably input-limited (by CPU usage of the input thread) at that point. PEAR was very slow so I did not run the full suite of speed testing on it, but it achieved 11.8 Mbp/s at 16 threads (on slightly different hardware). The single-threaded programs (COPE and fastq-join) beat PEAR up to around 16 threads, but never touched BBMerge or FLASH. Interestingly, PEAR is the only program that scales well with hyperthreading; I speculate that the program does a lot of floating-point operations.

In conclusion, for this dataset, BBMerge had substantially better specificity at similar sensitivity to all competitors. fastq-join had the highest merge rate, by a slight margin, but at the cost of a false positive rate that I would consider unusable in a production environment. The speeds of BBMerge and FLASH are similar per thread, with FLASH being around 10% higher; but BBMerge appears to be more scalable, and as such exceeded 150% of FLASH's peak throughput. So, for this dataset, unless false positives are irrelevant, BBMerge wins.

It appears imprudent to use FLASH (or most other mergers) with the default sensitivity, due to the huge increase in false positives for virtually no gain in true positives. On the ROC curve, FLASH's rightmost point is the default “-x 0.25”, while the third from the right is “-x 0.15”, which gives a much better signal-to-noise ratio of 33.3dB and 71.1% merged compared to the default 28.8dB and 73.8% merged. If you desire the highest merge rate regardless of false-positives, fastq-join appears optimal. For libraries that have all N-containing reads removed, it may be that COPE is slightly better than FLASH, though I have not verified this. If false-positives are detrimental to your project, I recommend BBMerge on default sensitivity, which gave a 42.7dB SNR and 55.8% merge rate for this data. The merge rate can be increased using the "loose" or "vloose" flags, at the expense of more false-positives, but I believe that for most projects, minimizing noise is more important than maximizing signal.

BBMerge is available with the BBTools package here:

https://sourceforge.net/projects/bbmap/

On Linux systems you can simply run the shellscript, like this:

bbmerge.sh in1=read1.fq in2=read2.fq out=merged.fq

BBMerge supports interleaved reads, ASCII-33 or 64, gzip, fasta, scarf, and various other file formats. It also makes handy insert-size distribution histograms. Everything in the BBTools package is compatible with any OS that supports Java - Linux, Windows, Mac, even cell phones. The shellscripts may only work in Linux, but the actual programs will run anywhere. For more details, run the shellscript with no arguments, or post a question here.

EDIT: Results updated March 25, 2015, on a poster:
https://drive.google.com/a/lbl.gov/f...ew?usp=sharing
Attached Images
File Type: png BBMerge_ROC.png (33.9 KB, 120 views)
File Type: png BBMerge_Speed.png (22.1 KB, 33 views)

Last edited by Brian Bushnell; 03-25-2015 at 03:38 PM. Reason: Updated embedded images
Brian Bushnell is offline   Reply With Quote
Old 06-05-2014, 04:30 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,489
Default

Brian: You have put a single help document together for all the programs in BBMap. I am not sure you have done that.

I am losing track of all the programs that are now part of BBMap suite
GenoMax is offline   Reply With Quote
Old 06-05-2014, 04:02 PM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I have updated the speed results by moving the input and output files to a ramdisk to eliminate I/O bottlenecks, and finishing all datapoints for PEAR. Results attached. Also, if anyone has a different program they use for read merging that they want added to the roundup, please post it here before my test data is purged from /scratch in ~6 weeks.

Quote:
Originally Posted by GenoMax View Post
Brian: You have put a single help document together for all the programs in BBMap. I am not sure you have done that.

I am losing track of all the programs that are now part of BBMap suite
I am too... I definitely need to put together such a document! Hopefully soon. I have an internal document describing them but it's not totally relevant outside our cluster.
Attached Images
File Type: png BBMerge_Speed2.png (21.0 KB, 68 views)
Brian Bushnell is offline   Reply With Quote
Old 06-06-2014, 02:00 AM   #4
jimmybee
Senior Member
 
Location: Adelaide, Australia

Join Date: Sep 2010
Posts: 119
Default

Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)
jimmybee is offline   Reply With Quote
Old 06-06-2014, 05:43 AM   #5
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 274
Default

You may want to add COPE to your tests since it is published. Also, there are a number of other unpublished (I think) programs on this blog post, but I don't think it's necessary to test every script out there. I mention COPE because it works with Fasta, whereas the other tools require Fastq, so that is a major convenience that this tool provides.

There are a couple of caveats with COPE that I can share. 1) The paired reads have to be in the same order in the forward and reverse files. The program will run just fine if they are not, but the number of reads merged will be very low (this would be a nice thing to check for). 2) There is an option for setting the overlap mode and this has to be given or the program will fail silently ("-m 0" is the simple overlap mode).
SES is offline   Reply With Quote
Old 06-06-2014, 10:31 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by jimmybee View Post
Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)
jimmybee,

FYI, the latest version of BBMerge by default rejects insert sizes shorter than 35 and does not look for insert sizes shorter than 25, so if you have super-short insert libraries, you might want to reduce those with these flags:
mininsert=12 minoi=12

"mininsert" will reject anything shorter than that value as "too short", while "minoi" (minoverlapinsert) will tell the program to not look for inserts shorter than that. The difference is subtle. Sorry, the default values listed for them in the shellscript are wrong; I will fix that with the next release. Of course for most purposes you don't ever see insert sizes that short but you can in certain experiments involving tiny RNAs.

I have no plans to explicitly add adapter matching to BBMerge, but it sounds like it may be useful to add a mode that trims reads with insert sizes shorter than read length rather than merging them. Also, it would be fairly easy to add overlap-detection to BBDuk for greater adapter-trimming sensitivity. Hmmm, I'll probably do both.

@SES,

For some reason I can't get COPE to work:

Quote:
/dev/shm$ ./cope -a /dev/shm/sr1.fq -b /dev/shm/sr2.fq -o /dev/shm/cmerged.fq -m 0 -s 33
Program start..
Program: COPE (Connecting Overlapped Pair-End reads)
Version: v1.1.3
Author: BGI-ShenZhen
CompileDate: Jun 6 2014 time: 10:30:24
Current time: Fri Jun 6 10:46:56 2014
Command line: ./cope -a /dev/shm/sr1.fq -b /dev/shm/sr2.fq -o /dev/shm/cmerged.fq -m 0 -s 33
Error: input file is empty, please check the set of input file!
Other programs can see the files fine. I moved them to various places and tried both absolute and relative paths. Any ideas?
Brian Bushnell is offline   Reply With Quote
Old 06-06-2014, 01:25 PM   #7
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 274
Default

Quote:
Originally Posted by Brian Bushnell View Post

@SES,

For some reason I can't get COPE to work:


Other programs can see the files fine. I moved them to various places and tried both absolute and relative paths. Any ideas?
Try specifying the files for unconnected pairs. I would try:

Code:
./cope -a /dev/shm/sr1.fq -b /dev/shm/sr2.fq -o /dev/shm/cmerged.fq -2 /dev/shm/sr1_unconnected.fq -3 /dev/shm/sr2_unconnected.fq -m 0 -s 33
I know that solution doesn't logicially follow from the error message, but from what I remember, you will get either no messages on failure with this program or nondescript messages. If that doesn't work, I'd check the input again, and then try without the "-s 33" as a last suggestion. It could be the input, but I'm guessing it's just the combination of options, or lack of, that is causing the issue.
SES is offline   Reply With Quote
Old 06-06-2014, 04:23 PM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Thanks SES, that solved it. I guess all outputs must be specified. I've attached the new graphs and updated the inline images in the first post.

COPE performed much better when I set the flag "-N 0" because 5% of my synthetic reads contain an N, so I used that flag, but it was unstable and seg-faulted at and value of -c under 0.92. I included both settings, "-N 1" (default) and "-N 0". Overall it was very close to FLASH when I swept through the -c parameter (match ratio cutoff) from 1.0 to 0.75 (the default) - slightly better with -N 1 (but unstable) and slightly worse with -N 0 (but stable). There are also -d and -B parameters that probably impact the merge rate, which I left at the default, so as with most of the other programs, it's still possible that better results could be achieved. Sample command:

./cope -a /dev/shm/r1.fq -b /dev/shm/r2.fq -o /dev/shm/cmerged.fq -2 /dev/shm/r1_unconnected.fq -3 /dev/shm/r2_unconnected.fq -m 0 -s 33 -c 1.0 -u 150 -N 0 1>cope.o 2>&1

I also added fastq-join (from ea-utils), which was generally outperformed by FLASH. I swept through the -p (from 0 to 25) and -m (from 4 to 12) parameters; only the -p sweep is shown, as it gave better results. Sample command:

fastq-join r1.fq r2.fq -o fqj -p 0

These were similar in speed - both were much faster than PEAR and slightly slower than BBMerge or FLASH, when those three were limited to a single thread. fastq-join is slightly faster than COPE.
Attached Images
File Type: png BBMerge_ROC_log.png (30.1 KB, 995 views)
File Type: png BBMerge_Speed3.png (23.2 KB, 981 views)
File Type: png BBMerge_ROC_linear.png (27.5 KB, 1003 views)

Last edited by Brian Bushnell; 06-06-2014 at 04:26 PM.
Brian Bushnell is offline   Reply With Quote
Old 06-17-2014, 12:10 AM   #9
woodydon
Member
 
Location: https://www.researchgate.net/profile/Woody_Lin

Join Date: Jan 2010
Posts: 52
Default

Is there a list of parameters that I can use? For example, insert size and cpu threads. I tried to assembled a set of ~24 million pair-end reads and found that only 44.59% were assembled. Is there a way to increase (e.g. set loose)? Also, the input reads were 100bp in length and 10th percentile showed size 74. Why is that? By the way, how can the insert range as high as 188?

Quote:
Reads: 24316810
Joined: 10843877 44.594%
Ambiguous: 62051 0.255%
No Solution: 13397906 55.097%
Too Short: 12976 0.053%
Avg Insert: 132.7

Insert range: 26 - 188
90th percentile: 178
50th percentile: 140
10th percentile: 74

Bests,
Woody

Last edited by woodydon; 06-17-2014 at 12:24 AM.
woodydon is offline   Reply With Quote
Old 06-17-2014, 09:02 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Woody,

If you run the shellscript bbmerge.sh with no parameters, it will display the help information. To answer your specific questions:

By default it uses all available threads, but you can override that with the "t=" flag. You can get a higher merge rate at the expense of more false positives with the "loose" or "vloose" (very loose) flag. It's also possible to more finely tune it with other flags like "minoverlap", "margin", "entropy", and "mismatches", but those are more complicated to tweak. You can use the "mininsert=" flag to ban merges with an insert size shorter than that value.

2x100bp reads can have a 74bp insert or a 188bp insert size. 74bp insert means that the molecule being sequenced was shorter than read length, and as a result the data collection continued off the end of the genomic sequence and into the adapter. So, before merging, the reads each contained 26bp of junk. And a 188bp insert size means that the reads overlapped by 12 base pairs. BBMerge does not look for overlaps shorter than 12bp in the default mode; the shorter the overlap, the more likely that it's a false positive.

The higher the error rate of reads, the fewer successful joins there will be. If you have sufficient coverage (at least 30x average) you can try error-correcting the reads first with my error correction tool; that should increase the merge rate.

ecc.sh in1=r1.fq in2=r2.fq out1=corrected1.fq out2=corrected2.fq

-Brian
Brian Bushnell is offline   Reply With Quote
Old 06-17-2014, 04:37 PM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by jimmybee View Post
Ive been using it over the last couple of weeks and it has been really quick. Im actually using it to run adapter trimming (with small inserts, all of your reads should overlap), so adding in some options for adapter matching would be handy (although I do know that you already have a trimming script in the bbtools suite)
jimmybee,

I added some relevant flags and modes, in version 32.32. Thanks for the suggestions!

First, BBMerge now has a new flag, "tbo" (trimbyoverlap). Rather than joining reads, it will just trim the adapter sequence off the ends of reads that have an apparent insert size shorter than read length.

Second, I added the ability to overlap to BBDuk also, with the same flag, "tbo". So if you are trimming adapters fragment libraries with the normal read orientation, you can trim both by specifying an adapter sequence file and by overlap at the same time, which drastically reduces the false negative rate (reads that did not have adapters completely removed) by around a factor of 12. I highly recommend using both for fragment libraries, as the combination outperforms either one alone:

bbduk.sh in=reads.fq out=trimmed.fq tbo ktrim=r ref=truseq.fa k=25 mink=1 hdist=1

The 'tbo' flag is NOT recommended for long mate pair libraries which may have the adapters in totally different places and are not expected to overlap, which is why it is disabled by default.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 06-17-2014, 10:48 PM   #12
woodydon
Member
 
Location: https://www.researchgate.net/profile/Woody_Lin

Join Date: Jan 2010
Posts: 52
Wink

Quote:
Originally Posted by Brian Bushnell View Post
Woody,

2x100bp reads can have a 74bp insert or a 188bp insert size. 74bp insert means that the molecule being sequenced was shorter than read length, and as a result the data collection continued off the end of the genomic sequence and into the adapter. So, before merging, the reads each contained 26bp of junk. And a 188bp insert size means that the reads overlapped by 12 base pairs. BBMerge does not look for overlaps shorter than 12bp in the default mode; the shorter the overlap, the more likely that it's a false positive.

-Brian
Thanks for the reply. It turns out that I confused your "insert size" with "inner distance" of pair-end reads. Since the ~25 million reads were exome-seq data, 44.594% assemble rate is a bit lower than what I expected. If the assemble is perfect, it means 50% of the fragments were actually longer than 188. I hope you could improve those fragments whose "insert size" is longer than 188 in the future.
woodydon is offline   Reply With Quote
Old 06-18-2014, 08:24 AM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by woodydon View Post
Thanks for the reply. It turns out that I confused your "insert size" with "inner distance" of pair-end reads. Since the ~25 million reads were exome-seq data, 44.594% assemble rate is a bit lower than what I expected. If the assemble is perfect, it means 50% of the fragments were actually longer than 188. I hope you could improve those fragments whose "insert size" is longer than 188 in the future.
Woody,

Did you try error-correcting and using the 'vloose' setting? Both of those can substantially improve the merge rate.

If you want to see what the actual, unbiased insert size distribution of your data, then you can generate a histogram by mapping:

bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fa ihist=ihist.txt nodisk

Then plot a graph of 'ihist.txt', so you can see how many reads should be expected to assemble.
Brian Bushnell is offline   Reply With Quote
Old 06-18-2014, 06:10 PM   #14
woodydon
Member
 
Location: https://www.researchgate.net/profile/Woody_Lin

Join Date: Jan 2010
Posts: 52
Wink

Quote:
Originally Posted by Brian Bushnell View Post
Woody,

Did you try error-correcting and using the 'vloose' setting? Both of those can substantially improve the merge rate.

If you want to see what the actual, unbiased insert size distribution of your data, then you can generate a histogram by mapping:

bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fa ihist=ihist.txt nodisk

Then plot a graph of 'ihist.txt', so you can see how many reads should be expected to assemble.
Great. That's an neat option.
woodydon is offline   Reply With Quote
Old 06-22-2014, 09:31 PM   #15
woodydon
Member
 
Location: https://www.researchgate.net/profile/Woody_Lin

Join Date: Jan 2010
Posts: 52
Unhappy

Hi Brain,

I am using a computing cluster to run BBMerge and found the following error:

Quote:
java -ea -Xmx48g -cp /home/xxx/tools/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx48g in1=/home/xxx/reads/n_R1.fastq in2=/home/xxx/reads/n_R2.fastq ref=/home/xxx/ref/hg19c/genome_hg19.fa ihist=/home/xxx/bbmap/n_ihist.txt out=/home/xxx/bbmap/n_merged.fq
/home/xxx/tools/bbmap/bbmap.sh: line 176: java: command not found
my command was:
Quote:
/home/xxx/tools/bbmap/bbmap.sh -Xmx48g threads=10 in1=/home/xxx/reads/n_R1.fastq in2=/home/xxx/reads/n_R2.fastq ref=/home/xxx/ref/hg19c/genome_hg19.fa ihist=/home/xxx/bbmap/n_ihist.txt out=/home/xxx/bbmap/n_merged.fq

Thanks,
Woody

Last edited by woodydon; 06-22-2014 at 09:36 PM.
woodydon is offline   Reply With Quote
Old 06-22-2014, 10:00 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Woody,

It sounds like Java is not installed on your cluster. Try running this command:

java -version

It should print some kind of message, like this:

java version "1.6.0_07"
Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
Java HotSpot(TM) Client VM (build 10.0-b23, mixed mode)


If it just says "command not found" then java is not installed. You should contact the system administrator, or download it from here:

http://www.oracle.com/technetwork/ja...s-2133155.html

...and put 'java' in your path. Or explicitly change the path of 'java' in my shellscript to wherever you install it.
Brian Bushnell is offline   Reply With Quote
Old 06-22-2014, 10:16 PM   #17
woodydon
Member
 
Location: https://www.researchgate.net/profile/Woody_Lin

Join Date: Jan 2010
Posts: 52
Default

Hi Brian,

Thank you for the quick reply. It indeed was the problem of not finding java. I have consulted with the administrator about this since I used java before without a problem.

Sincerely,
Woody
woodydon is offline   Reply With Quote
Old 06-22-2014, 11:29 PM   #18
woodydon
Member
 
Location: https://www.researchgate.net/profile/Woody_Lin

Join Date: Jan 2010
Posts: 52
Wink

Quote:
Originally Posted by Brian Bushnell View Post
Woody,

Did you try error-correcting and using the 'vloose' setting? Both of those can substantially improve the merge rate.

If you want to see what the actual, unbiased insert size distribution of your data, then you can generate a histogram by mapping:

bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fa ihist=ihist.txt nodisk

Then plot a graph of 'ihist.txt', so you can see how many reads should be expected to assemble.
I can't find error-correcting option in bbmerge.sh. Also, in bbmap.sh, "out=" seems outputs reads instead of sam files. Could you please give me a hint? Thanks!

Last edited by woodydon; 06-22-2014 at 11:33 PM.
woodydon is offline   Reply With Quote
Old 06-23-2014, 03:41 AM   #19
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,489
Default

Quote:
Originally Posted by woodydon View Post
I can't find error-correcting option in bbmerge.sh. Also, in bbmap.sh, "out=" seems outputs reads instead of sam files. Could you please give me a hint? Thanks!
With bbmap.sh out=filename.sam should produce a sam file (http://seqanswers.com/forums/showpos...69&postcount=1).
GenoMax is offline   Reply With Quote
Old 06-23-2014, 08:18 AM   #20
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by woodydon View Post
I can't find error-correcting option in bbmerge.sh. Also, in bbmap.sh, "out=" seems outputs reads instead of sam files. Could you please give me a hint? Thanks!
Woody,

bbmerge does not do error-correction; for that, you need to use the script "ecc.sh", e.g.

ecc.sh in=reads.fq out=corrected.fq

And as Genomax implied, all of my programs are sensitive to file name extensions. If you say "out=reads.fasta" you'll get fasta; if you say "out=reads.sam.gz", you'll get gzipped sam, and so forth.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bbmap, bbmerge, bbtools, flash, pair end

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:00 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO