ABySS

From SEQwiki
Jump to: navigation, search

Application data

Created by Simpson J, Jackman S, Birol I
Biological application domain(s) De-novo assembly
Principal bioinformatics method(s) Assembly, De Bruijn graph
Technology Illumina, 454, ABI SOLiD, Sanger
Created at Canada's Michael Smith Genome Sciences Centre
Maintained? Yes
Input format(s) FASTA, FASTQ, QSEQ, SAM, BAM
Output format(s) FASTA, Graphviz dot
Software features MPI, OpenMP
Programming language(s) C++
Software libraries Boost
Licence Commercial, Freeware
Operating system(s) POSIX, Linux, Mac OS X

Summary: ABySS is a de novo sequence assembler designed for short reads and large genomes.

The single-processor version is useful for assembling genomes up to 100 Mbp in size. The parallel version is implemented using MPI and is capable of assembling mammalian-sized genomes.

The output of ABySS is a set of contigs assembled from short reads (the input). The fasta header of each contig has the following format:

>n iii jjjj

Where n is the numeric contig ID, iii is the contig length in nucleotides, and jjjj is the absolute k-mer coverage. A fourth column will be present for paired-end contigs (scaffolds) and is the list of single-end contig IDs that compose that paired-end contig (scaffold). If you suspect a misassembly, it can be informative to look at the single-end contigs.

Contents

ABySS README

ABySS - assemble short reads into contigs

Compiling ABySS

Compiling ABySS should be as easy as

./configure && make

To install ABySS in a specified directory

./configure --prefix=/opt/ABySS && make && sudo make install

If you wish to build the parallel assembler with MPI support, MPI should be found in /usr/include and /usr/lib or its location specified to configure:

./configure --with-mpi=/usr/lib/openmpi && make

ABySS should be built using Google sparsehash to reduce memory usage, although it will build without. Google sparsehash should be found in /usr/include or its location specified to configure:

./configure CPPFLAGS=-I/usr/local/include

The default maximum k-mer size is 64 and may be decreased to reduce memory usage or increased at compile time:

./configure --enable-maxk=96 && make

To run ABySS, its binaries should be found in your PATH.


Single-end assembly

Assemble short reads in a file named reads.fa into contigs in a file named contigs.fa with the following command:

ABYSS -k25 reads.fa -o contigs.fa

where -k is an appropriate k-mer length. The only method to find the optimal value of k is to run multiple trials and inspect the results. The following shell snippet will assemble for every value of k from 20 to 40.

for k in {20..40}; do
    ABYSS -k$k reads.fa -o contigs-k$k.fa
done

The maximum value for k is 64. This limit may be changed at compile time using the --enable-maxk option of configure. It may be decreased to 32 to decrease memory usage, which is particularly useful for large parallel jobs, or increased to 96.


Paired-end assembly

To assemble paired short reads in a file named reads.fa into contigs in a file named paired-contigs.fa, run the command:

abyss-pe k=25 n=10 in='reads1.fa reads2.fa' name=ecoli

where k is the k-mer length as before. n is the minimum number of pairs needed to consider joining two contigs. The optimal value for n must be found by trial. in specifies the input files to read, which may be in FASTA, FASTQ, qseq or export format and compressed with gz, bz2 or xz. The assembled contigs will be stored in ${name}-contigs.fa.

The suffix of the read identifier for a pair of reads must be one of '1' and '2', or 'A' and 'B', or 'F' and 'R', or 'F3' and 'R3', or 'forward' and 'reverse'. The reads may be interleaved in the same file or found in different files; however, interleaved mates will use less memory.

abyss-pe is a driver script implemented as a Makefile and runs a single-end assembly, as described above, and the following commands, which must be found in your PATH:

  • ABYSS - the single-end assembler
  • AdjList - finds overlaps of length k-1 between contigs
  • KAligner** - aligns reads to contigs
  • ParseAligns** - finds pairs of reads in alignments
  • DistanceEst** - estimates distances between contigs
  • Overlap - find overlaps between blunt contigs
  • SimpleGraph - finds paths between pairs of contigs
  • MergePaths - merges consistent paths
  • Consensus - for a colour-space assembly, convert the colour-space contigs to nucleotide contigs

** These steps can be run in parallel (see below)

Paired-end assembly of multiple fragment libraries

The distribution of fragment sizes of each library is calculated empirically by aligning paired reads to the contigs produced by the single-end assembler, and the distribution is stored in a file with the extension .hist, such as ecoli-4.hist. The N50 of the single-end assembly must be well over the fragment-size to obtain an accurate empirical distribution.

Here's an example scenario of assembling a data set with two different fragment libraries and single-end reads:

Library lib1 has reads in two files, lib1_1.fa and lib1_2.fa. Library lib2 has reads in two files, lib2_1.fa and lib2_2.fa. Single-end reads are stored in two files se1.fa and se2.fa.

The command line to assemble this example data set is...

abyss-pe -j2 k=25 n=10 name=ecoli lib='lib1 lib2' \
    lib1='lib1_1.fa lib1_2.fa' lib2='lib2_1.fa lib2_2.fa' \
    se='se1.fa se2.fa'

The paired-end assembly of lib1 and lib2 may be run in parallel by specifying the -j option of make to abyss-pe, which is implemented as a Makefile script. The -j option should be set to the number of libraries, but setting it higher will not cause any trouble.

The empirical distribution of fragment sizes will be stored in two files named lib1-3.hist and lib2-3.hist. These files may be plotted to check that the empirical distribution agrees with the expected distribution. The assembled contigs will be stored in ${name}-contigs.fa.

Reads without mates should be placed in a file specified by the `se' (single-end) parameter. Reads without mates in the paired-end files will slow down the paired-end assembler considerably during the ParseAligns stage.


Parallel assembly

The `np' option of abyss-pe specifies the number of processes to use for the ABYSS-P parallel MPI job. Without any MPI configuration, this will allow you to make use of multiple cores on a single machine. To use multiple machines for assembly, you must create a hostfile for mpirun, which is describe in the mpirun man page.

The paired-end assembly runs on a single processor. For very large jobs, a good portion of the paired-end assembly (KAligner, ParseAligns, DistanceEst) may be run in parallel separate processes, but this process is not automated by the driver script abyss-pe.

Open MPI integrates well with SGE (Sun Grid Engine). For example, to submit an array of jobs to assemble every odd value of k between 51 and 63 using 64 processes for each job:

qsub -pe openmpi 64 -t 51-63:2 -N testing abyss-pe in=reads.fa n=10

For more information on using SGE and qsub, please refer to the qsub manual page. Open MPI must have been compiled with support for SGE using the ./configure --with-sge option.

NOTE: Univa's fork of SGE/OGE reguires the qsub option '-b' to be set to yes. This lets qsub know that the command (abyss-pe) is a binary file. Without this option qsub tries to run abyss-pe as a script file.


Links


References

  1. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. 2009. Genome Research
  2. Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJ. 2009. Bioinformatics


To add a reference for ABySS, enter the PubMed ID in the field below and click 'Add'.


[ edit box ]

Search for "ABySS" in the SEQanswers forum / BioStar or:

Web Search Wiki Sites Scientific
Personal tools
Namespaces

Variants
Actions
wiki navigation
Software
Toolbox