SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
meta-velvet returns nodes instead of contigs in assembly? deprekate Bioinformatics 2 10-25-2012 01:53 PM
PubMed: Knime4Bio: a set of custom nodes for the interpretation of next-generation se Newsbot! Literature Watch 0 03-14-2012 02:20 AM
v3: Effect of high cluster densities on cluster PF and %Q30 pmiguel Illumina/Solexa 3 10-05-2011 05:36 AM
Are there any alignment programs that take SNPs into account during alignment? sdarko Bioinformatics 2 06-04-2011 05:09 AM
gap alignment and local alignment? mingkunli Illumina/Solexa 3 02-19-2009 11:13 AM

Reply
 
Thread Tools
Old 06-21-2013, 03:48 AM   #1
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default Bison: BISlfite alignment On Nodes of a cluster

Greetings all,

I'd like to announce the general availability of a program that I've recently written called Bison (BISulfite alignment On Nodes of a cluster), which is intended for those who need to align bisulfite-converted reads and have access to a computer cluster. Bison is quite similar to Bismark (I'm a former Bismark user and wrote Bison to get my alignments sooner), with major differences as follows:
  • Bison is often 5-10x faster, due largely to the fact that it allocates individual cluster nodes to aligning reads to each strand. It combines information from the “aligner nodes” on a separate “master node”.
  • Bison uses the samtools C API to output alignments directly to BAM format, thereby saving space and disk I/O.
  • Bison is written purely in C, results in a bit more of a speed gain.
  • Bison also decides upon the correct alignment in a slightly different way than Bismark, resulting in fewer misaligned reads (0.02-0.03% versus ~0.6% for Bismark).
  • Bison requires only enough RAM for a single instance of bowtie2, as opposed to enough for 2-4 instances.
Otherwise, Bison will be quite familiar to those of you already accustomed to using Bismark. Both directional and non-directional libraries are supported. Bowtie2 is used for alignment on each of the nodes. Both paired-end and single-end libraries are supported. A methylation extractor is included that outputs into bedGraph format (if people would like a different format or different information, just ask). For those doing RRBS, I should note that the methylation extractor can be told that you are doing RRBS (currently MspI and TaqI digested libraries are supported) and it will then ignore methylation calls of bases added experimentally during fragment end-repair (this avoids needing to trim them off prior to alignment).

I should note that Bison does not currently support color-space reads, as I've never actually had any. Further, it is generally less flexible than Bismark, so I encourage users interested in Bison to try some test data to see if Bison meets their needs.

Bison source code and directions for compilation and usage are available via sourceforge. Samtools and Bowtie2 are required for installation. Likewise, MPI is required for compilation and usage, though you can run Bison on a single computer if desired. If people run into installation or usage problems, please feel free to post in this thread or submit a ticket on sourceforge.

Devon Ryan
dpryan is offline   Reply With Quote
Old 10-17-2013, 07:03 AM   #2
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default version 0.1.1 is now available

I've finally gotten around to updating the version of bison on sourceforge (now 0.1.1) so it's current with my local version. Changes are as follows:
  • Fixed a number of minor bugs.
  • Added support for uncompressed fastq files, as well as bzipped files (previously, only gzipped fastq files worked properly).
  • --score-min is now parsed by bison prior to being sent to bowtie2, read MAPQ scores are recalculated accordingly by the same algorithm used by bowtie2 (N.B., this only bears a vague correspondence to -10*log10(probability the mapping position is wrong)!).
  • Added a bison_mbias function, to process the aligned BAM file and create a text file containing the percentage of methylated C's as a function of read position. For the utility of this, see: Hansen KD, Langmead B and Irizarry RA, BSmooth: from whole genome bisulfite sequencing to differentially methylated regions. Genome Biol 2012; 13(10):R83.
  • The methylation extractor now accepts the -q options, which sets the MAPQ threshold for a read to be included in the methylation results. The default is a minimum MAPQ of 20, which seems to be a reasonable threshold from a few simulations.
  • In DEBUG mode, the output BAM files used to have fixed names. This was a problem in cases where debugging was being performed on multiple input files. Now, the OT/OB/CTOT/CTOB.bam filename is prepended with an appropriate prefix (extracted from the input file name). In addition, the output directory is now respected in DEBUG mode.
  • Included an "auxiliary" directory, that includes functions for making an RRBS genome and other possibly useful functions.

Unless some bugs crop up, I expect the next release will support a semi-arbitrary number of nodes. As is, only 3 or 5 nodes can be used at a time. This is fine for me, since I'm usually processing a number of samples in parallel and our cluster is relatively small. I can easily envision others finding more nodes useful. I have a few implementation ideas for this.
dpryan is offline   Reply With Quote
Old 12-05-2013, 04:01 AM   #3
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default Version 0.2.0 now available

I've just posted version 0.2.0 to sourceforge. This big change in this release is the inclusion of bison_herd, which can use a semi-arbitrary number of nodes (e.g., I'm using 17 nodes to simultaneously align some samples at the moment). bison_herd can also accept a list of input files and will write each of their alignments to separate files. This is useful when you have a number of samples and want to skip the overhead of loading the bowtie2 index and the genome into memory more than once. bison_herd also skips writing in-silico converted reads to a file, further increasing performance. Other changes:
  • Added a note to the methylation summary statistics output at the end of a run that the numbers will include double counting of any site covered by both mates in a pair. These metrics are only meant for general information and not further analysis, so I don't consider that a bug (it's actually a design decision for the sake of performance).
  • --ignore-quals is no longer passed to bowtie2 by default. Specifying this will marginally decrease both correct and incorrect alignments. It will also generally decrease the alignment rate.
  • Fixed --unmapped, which are now written to the directory specified by -o
  • --maxins was already 500 by default, so it is no longer set by default.
  • The methylation extractor now has a -phred option, to exclude methylation calls from low confidence base-calls. The default threshold is 20.
  • Added a script to convert bedGraph files to a format suitable for BSseq.
  • Fixed a bug in bison_merge_CpGs

The only thing left on my "To Do" list is to add support for filename globbing (e.g. sample_*_1.fastq.gz and sample_*_2.fastq.gz) to make feeding bison_herd (and the auxiliary scripts) with multiple files easier.
dpryan is offline   Reply With Quote
Old 12-05-2013, 10:26 AM   #4
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 68
Default

Hi Devon, this seems very interesting. I have been having trouble with the amount of plain-text output from Bismark.

this seems to require exactly mpich, openmpi doesn't seem to work, is that correct? Once I get past that, I get:

Code:
mpicc -c -Wall -O3   main.c -o main.o
mpicc -Wall -O3  aux.o fastq.o genome.o slurp.o master.o common.o MPI_packing.o worker.o main.o -o bison -L/home/brentp/src/samtools  -lm -lpthread -lmpich -lmpl -lbam -lz
aux.o: In function `quit':
aux.c:(.text+0xcf2): undefined reference to `ompi_mpi_comm_world'
aux.o: In function `effective_nodes':
aux.c:(.text+0xef5): undefined reference to `ompi_mpi_comm_world'
slurp.o: In function `slurp':
slurp.c:(.text+0x1f9): undefined reference to `ompi_mpi_comm_world'
slurp.c:(.text+0x1fe): undefined reference to `ompi_mpi_int'
slurp.c:(.text+0x23d): undefined reference to `ompi_mpi_comm_world'
slurp.c:(.text+0x24d): undefined reference to `ompi_mpi_int'
slurp.c:(.text+0x277): undefined reference to `ompi_mpi_comm_world'
slurp.c:(.text+0x287): undefined reference to `ompi_mpi_byte'
slurp.c:(.text+0x347): undefined reference to `ompi_mpi_byte'
slurp.c:(.text+0x34d): undefined reference to `ompi_mpi_comm_world'
slurp.c:(.text+0x3a6): undefined reference to `ompi_mpi_comm_world'
slurp.c:(.text+0x3c3): undefined reference to `ompi_mpi_byte'
slurp.c:(.text+0x3ff): undefined reference to `ompi_mpi_byte'
slurp.c:(.text+0x405): undefined reference to `ompi_mpi_comm_world'
worker.o: In function `worker_node':
worker.c:(.text+0x1c2): undefined reference to `ompi_mpi_comm_world'
worker.c:(.text+0x1cc): undefined reference to `ompi_mpi_int'
worker.c:(.text+0x390): undefined reference to `ompi_mpi_comm_world'
worker.c:(.text+0x39b): undefined reference to `ompi_mpi_byte'
worker.c:(.text+0x3dd): undefined reference to `ompi_mpi_comm_world'
worker.c:(.text+0x3ea): undefined reference to `ompi_mpi_byte'
worker.c:(.text+0x453): undefined reference to `ompi_mpi_comm_world'
worker.c:(.text+0x45e): undefined reference to `ompi_mpi_byte'
worker.c:(.text+0x5be): undefined reference to `ompi_mpi_comm_world'
worker.c:(.text+0x5c9): undefined reference to `ompi_mpi_int'
worker.c:(.text+0x5e3): undefined reference to `ompi_mpi_comm_world'
worker.c:(.text+0x5ee): undefined reference to `ompi_mpi_byte'
collect2: ld returned 1 exit status
make: *** [align] Error 1
How can I get past that error?
brentp is offline   Reply With Quote
Old 12-05-2013, 11:21 AM   #5
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default

Hi brentp, both should work fine. Try removing "-lmpich -lmpl" and replacing that with "-lmpi". I just replaced mpich2 with open-mpi and that then compiled (i'll add this to my list of things to fix). Of course it now segfaults for me . I'll have to play around with openMPI since I guess I've never used that. Let me know if this works for you, since that'll just mean that openMPI isn't completely installed on my system.

BTW, I think the most recent version of bismark has an option to output directly to BAM (it just pipes to samtools).
dpryan is offline   Reply With Quote
Old 12-05-2013, 11:43 AM   #6
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default

Ah, the ensuing segfault was just due to some mpich2 headers still being found by mpicc (mixing mpich2 and openmpi doesn't work well). Deleting those such that only openmpi headers were being used solved that. So, just changing the -l option should solve the problem for you. Let me know if you run into any other issues. I would like to get as many of the bugs ironed out as possible.
dpryan is offline   Reply With Quote
Old 12-05-2013, 12:47 PM   #7
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 68
Default

I had to add -lpthread to bias and methylation extractor but it installed fine with openmpi after that.

Thanks.
brentp is offline   Reply With Quote
Old 01-02-2014, 01:52 AM   #8
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default

I've just posted version 0.2.1, which contains a number of bug fixes and a few feature enhancements, to to sourceforge. The changes were as follows:
  • Added support for file globbing in bison_herd. You may now input multiple files using a combination of wild-cards (*, ?, etc.) and commas. Remember to put these in quotes (e.g., "foo/*1.fq.gz","bar/*1.fq.gz") so the shell doesn't perform the expansion!). As before, specifying multiple inputs with the same file name (e.g., sample1/reads.fq,sample2/reads.fq) will cause the output from the first reads.fq alignment to be over-written by the second.
  • Fixed the text output, since "unique alignments" isn't really correct, given that alignments with scores of 0 or 1 can be output but aren't unique.
  • Added information in the Makefile and above about compiling with openmpi.
  • Fixed a bug in bison_herd wherein the -upto option wasn't being handled properly. -upto now accepts an unsigned long in bison_herd.
  • Fixed a bug in bison_herd when paired-end reads were used. This was due to how bowtie2 reads from FIFOs. Changing how things were written to the FIFOs on the worker nodes resolved the problem.
  • The bison_mbias program has been heavily revamped. It still outputs the number of methylated or unmethylated CpG calls per position, but now keeps the metrics for each strand (and read, when paired-end reads are used) separate. If R and the ggplot2 library are installed, the program can also run the bison_mbias2pdf program (see below).
  • Created an bison_mbias2pdf Rscript that will read in the output of bison_mbias and plot the results, indicating the region of each read that should be included in methylation extraction. This script also print these suggestions in the format used by bison_methylation_extractor, for convenience.
  • The methylation extractor can now be told to only include certain regions of each read in the output methylation metrics. This is needed when there is apparent bias in the methylation at one or both ends of a read.
  • Previously, the recalculated MAPQ was incorrect when only 1 read in a pair had a valid secondary alignment. This has been fixed.
  • Fixed another MAPQ recalculation bug, affecting reads with MAPQ 2 that have MAPQ=6.
  • Fixed a bug in writing unmapped reads.
  • Fixed a bug in bison_herd that allowed early termination without warning.

For those curious, I'm attaching a couple example M-bias plots generated by bison_mbias2pdf. The experiment was RRBS, so you can see the bias in the first and last 2 bases in many of the reads that need to be trimmed (this was a relatively early experiment, so the reads were generally not the best quality).



dpryan is offline   Reply With Quote
Old 01-08-2014, 01:36 AM   #9
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default

I've posted a quick update, version 0.2.2
  • Properly fixed some wording on the textual output (i.e., removed the word "unique").
  • Lowered the default MAPQ and Phred thresholds used by the methylation extractor to 10 each. That the MAPQ threshold was originally 20 was an error on my part.
dpryan is offline   Reply With Quote
Old 01-16-2014, 03:03 AM   #10
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default

I've upload version 0.2.3, which is mainly geared toward getting local alignment working properly (it worked before, but the methylation calls were completely off). My thanks to mvijayen in this thread for providing the impetus and some good example data to get this done.
  • Fix how hard and soft-clipped bases are dealt with (previously, soft-clipped bases resulted in an error and hard-clipped bases in incorrect position assignments!).
  • Multiple bug fixes related to local alignment, which previously didn't work correctly. These issues seem to generally now be resolved. May thanks to user mvijayen on seqanswers for providing a perfect usage example for testing (see thread http://seqanswers.com/forums/showthread.php?t=39914).
  • The maximum length of a single contig is now (2^64)-1 (instead of the previous 2^64). I don't think bowtie2 would even support something that long, but if it did then bison wouldn't (internally, a position of 2^64 means a base is inserted, soft, or hard-clipped).
  • A previously missing "*" caused Bison to use the entirety of the description line in the fasta file as the chromosome name. This caused errors since bowtie2 only uses every before the first space (the proper method). Bison now does the same.
  • A note about creating methylation-bias metrics with locally aligned reads is in order. If a read is soft-clipped, that portion is still included in the M-bias metrics. Likewise, if you pass -OT X,X,X,X or similar parameters to the methylation extractor, the soft-clipped area is also included in there.
  • Another note regarding local alignments is that the XX auxiliary tag (effectively the more verbose version of the MD tag) contains soft-clipped sequences. I could probably have these removed if someone would like.
dpryan is offline   Reply With Quote
Old 02-17-2014, 05:59 AM   #11
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default

I just posted version 0.2.4, which fixes a silly error on my part and adds a simple markduplicates program:
  • Fixed an off-by-one error in bison_mbias.
  • Added bison_markduplicates, which, as the name implies, marks apparent PCR duplicates. The methylation extractor and m-bias calculator have also been updated to ignore marked duplicates.

The bison_markduplicates program uses the chromosome and both 5' and 3' bounds of both mates (if there are paired-end reads) as well as the strand to determine PCR duplicates.

I just found and fixed a few other bugs in bison_mbias (at some point it started swapping the methylated and unmethylated metrics). The bug wouldn't have caused a big issue previously, since the only purpose was for determining trimming bounds, but it was wrong none-the-less. Another bug was in bison_CpG_coverage, which wasn't handling unmerged bedGraph files properly before (merged files were fine). Sorry about those!

Last edited by dpryan; 02-17-2014 at 08:33 AM. Reason: More changes
dpryan is offline   Reply With Quote
Old 02-26-2014, 10:04 AM   #12
blam
Junior Member
 
Location: North Carolina

Join Date: Dec 2013
Posts: 2
Default Help with bison_index

Hi,

I'm interested in using Bison instead of Bismark for my Bis-seq analysis. I think I have everything installed correctly, but I'm having trouble with indexing the reference genome.

bison_index will not accept a directory, but will accept a .fa file.

If I try to index multiple .fa files using /directory_of_reference/*.fa it seems to accept the first .fa as input and the second .fa file to create an outputfile.

I've looked at bison_index -h which suggests comma separated .fa files, but still no luck.

Any suggestions about what I am doing incorrect? I'm using human assembly GRCh37 as my reference.
blam is offline   Reply With Quote
Old 02-26-2014, 10:43 AM   #13
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default

Hi blam,

I have to admit that how I have bison_index do this is kind of silly. What would make more sense is, as you suggest, to just tell it what directory the fasta files are in and have it go from there, particularly since that's how all of the other bison tools work! I'll actually try to edit things to work that way tonight.

In the interim, using a comma-separated list should work (I just tested that and it at least works on my computer), keeping in mind that that means not including a space after the comma (this is also how bowtie2-build works and is done for purely logistic reasons). So, something like the following should work:

Code:
bison_index chr1.fa,chr10.fa,chr11.fa,chr12.fa
Yes, that's annoying and I will change it. Another possibility is to just:
Code:
cat *.fa > genome.fa
bison_index genome.fa
rm genome.fa
That's also not ideal, but should suffice while I make a better version. Let me know if you run into any other issues and I'll get them fixed.
dpryan is offline   Reply With Quote
Old 02-26-2014, 11:02 AM   #14
blam
Junior Member
 
Location: North Carolina

Join Date: Dec 2013
Posts: 2
Default bison_index THANKS

Thanks! The readme file made me think that bison_index took a directory. I am now indexing my reference with your suggestions above.
blam is offline   Reply With Quote
Old 02-26-2014, 11:04 AM   #15
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default

Looks like I'll be fixing the README file as well then :P
dpryan is offline   Reply With Quote
Old 02-27-2014, 03:25 AM   #16
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default

@blam: A not so small correction. The only solution that is actually correct was to concatenate files together, but then only keep the resulting multi-fasta file (i.e. "cat *.fa > genome.fa" and then either move genome.fa to its own directory or delete the other .fa files). The other solution is not guaranteed to always work correctly. I'm finishing a new release that will fix this. In the new version, bison_index will in fact accept a directory of fasta files, rather than needing one to specify files individually (in fact, I will explicitly remove the ability for it to handle that since it can have unintended consequences).

The previous implementation could work incorrectly in cases where the input file list didn't match the order in which the files appeared in the directory entry, which can actually change over time. What that would mean is that the files could have been indexed in one order (e.g., chr1, then chr2, then chr3, ...) but then later read into memory in a different order (e.g., chr3, then chr1, then chr2, ...), which could cause all sorts of problems. This could only occur if you passed bison_index a list of files, rather than a single multi-fasta file. While I don't expect people to get bitten by this bug, it's very much possible and I consider it a major issue. I'm testing a fix and will upload a new version within the next couple hours.

For anyone who stores the genome in a single file, this won't be an issue for you. If, however, you store chromosomes/contigs in individual files, then I recommend deleting the current indices (just "rm -rf bisulfite_genome" in the directory with the fasta files) and reindexing. The version I'm testing will always process files in the same order, regardless of their order in the dirent structure on disk, so this problem will be resolved.

Last edited by dpryan; 02-27-2014 at 03:28 AM.
dpryan is offline   Reply With Quote
Old 02-27-2014, 04:27 AM   #17
dpryan
Devon Ryan
 
Location: Bonn, Germany

Join Date: Jul 2011
Posts: 2,018
Default v0.3.0

I've just release version 0.3.0, which should address the problem I mentioned in my last post as well as a few other small bugs. I should note that you can now track the development version(s) of bison on github. I have a few branches (some not yet on github), implementing discordant/mixed alignments and using the development version of samtools/htslib.
  • Note: The indices produced by previous versions are not guaranteed to be compatible unless you used a multi-fasta file. There was a serious implementation problem with how bison_index worked when given multiple files as input and how multiple files were read into memory in previous versions. If you used a multi-fasta file, then everything will continue to work correctly. However, if you used multiple fasta files in a list then I strongly encourage you to delete the previous indices (just remove the bisulfite_genome directory) and reindex. The technical reasons for this issue are that when the bison tools previously read multiple fasta files into memory, they would do so in whatever order they appeared in the directory structure, which can change over time and isn't guaranteed to match the order of files someone specified during indexing. While the alignments wouldn't be affected by this, the methylation calls could have been seriously compromised. In this version, bison_index will only accept a directory, not a list of files, and it will always alphasort() the list of files in that directory prior to processing. This should eliminate this problem. My apologies to anyone affected by this.
  • Added --genome-size option to a number of the tools. Many of the bison programs need to read the genome into memory. By default, 3 gigabases worth of memory are allocated for that and the size increased as needed. For smaller genomes, this wasted space. For larger genomes, the constant reallocation of space could seriously slow things down. Consequently, this option was added to any tool that reads the genome into memory. It's convenient to overestimate this slightly, so if your genome is 3.8 gigabases, then just use 4000000000 as the genome size.
  • bison_merge_CpGs can now take multiple input files at once.
  • A number of small bug fixes, such as when "genome_dir" doesn't end in a /.
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:41 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.