Hi all,
If you work with large genomes and large sets of short reads, please take a look at Crossbow (http://bowtie-bio.sf.net/crossbow), an open source pipeline leveraging cloud computing for whole genome SNP discovery from short reads. Crossbow combines Bowtie and SoapSNP, under the umbrella of Hadoop. Hadoop handles all data movement and large distributed sorts (e.g. between alignment and SNP calling), and provides storage redundancy and fault tolerance. In experiments, we observe that Crossbow aligns Illumina reads and calls accurate SNPs (99% concordance with a BeadChip assay) from over 35x coverage of a human genome in one day on a 10-node local cluster, or in 3 hours for about $100 using a 40-node, 320-core Hadoop cluster rented from Amazon's EC2 utility computing service.
Crossbow is distributed with driver scripts both for running either on a local cluster or on a cluster rented through Amazon EC2. Crossbow also includes scripts that automatically preprocess and copy large datasets into Amazon S3. Both EC2 and S3 are accessible to anyone with an AWS account (and a credit card), giving the user full control over computers and storage rented over the Internet on a pay-as-you-go basis.
As of this posting, Crossbow is preliminary software (witness: the version number starts with a 0), though we are actively maintaining and extending it.
If you're looking for how to get started, first read through the "Checklist for Preparing to Run on Amazon Web Services" in the MANUAL file, then read through the TUTORIAL (which currently just points to the C. elegans example).
Crossbow is written by myself (Ben Langmead, Johns Hopkins University) and Michael C. Schatz at University of Maryland.
Thanks!
Ben and Mike
If you work with large genomes and large sets of short reads, please take a look at Crossbow (http://bowtie-bio.sf.net/crossbow), an open source pipeline leveraging cloud computing for whole genome SNP discovery from short reads. Crossbow combines Bowtie and SoapSNP, under the umbrella of Hadoop. Hadoop handles all data movement and large distributed sorts (e.g. between alignment and SNP calling), and provides storage redundancy and fault tolerance. In experiments, we observe that Crossbow aligns Illumina reads and calls accurate SNPs (99% concordance with a BeadChip assay) from over 35x coverage of a human genome in one day on a 10-node local cluster, or in 3 hours for about $100 using a 40-node, 320-core Hadoop cluster rented from Amazon's EC2 utility computing service.
Crossbow is distributed with driver scripts both for running either on a local cluster or on a cluster rented through Amazon EC2. Crossbow also includes scripts that automatically preprocess and copy large datasets into Amazon S3. Both EC2 and S3 are accessible to anyone with an AWS account (and a credit card), giving the user full control over computers and storage rented over the Internet on a pay-as-you-go basis.
As of this posting, Crossbow is preliminary software (witness: the version number starts with a 0), though we are actively maintaining and extending it.
If you're looking for how to get started, first read through the "Checklist for Preparing to Run on Amazon Web Services" in the MANUAL file, then read through the TUTORIAL (which currently just points to the C. elegans example).
Crossbow is written by myself (Ben Langmead, Johns Hopkins University) and Michael C. Schatz at University of Maryland.
Thanks!
Ben and Mike
Comment