Hi,
we are planning to sequence a eukaryote without a reference genome, with a genome size of around 1 GB. We don't have the resources to do a deep sequencing of it, but are thinking about hitting something like 6x-20x coverage, depending on what we want to go for.
Our first problem is deciding which sequencing strategy we want. Basically, we're looking at either overlapping reads (sequence 100 nt from each end of a fragment of 180 bp to get an overlap of 20 bp, or 150 nt from each end of a fragment of 250 bp or thereabouts) or non-overlapping reads (sequence 100 nt or 150 nt from each end of a 300-500 bp fragment.)
The main argument for overlapping reads seems to be that we can use something like FLASH (http://bioinformatics.oxfordjournals...tr507.abstract, http://www.cbcb.umd.edu/software/flash/) to combine the paired read, and the basically get a "super"-read of 180-250 bp with higher per base quality than with to non-overlapping reads.
The main argument for non-overlapping reads seems to be that we can scaffold a bit, get a longer range contiguity than with overlapping reads and get a bit longer contigs too.
I've been testing a bit on the parrot dataset (http://assemblathon.org/pages/download-data) since that is one of the few datasets I've found that have both kinds of reads. I've simulated pseudo 100 nt reads by removing the first 20 bases, and the last 30 bases of the two datasets with 220 bp fragments and 500 bp fragments, resulting in 180 bp fragments with 20 bp overlap and 460 bp fragments without overlap. I've also done some testing with the original data (i.e. the 150 nt reads).
I've run CEGMA (http://nar.oxfordjournals.org/content/37/1/289.abstract, http://korflab.ucdavis.edu/Datasets/cegma/) to get a feel of how complete my assembly is in addition to the regular N50. I've mostly run SOAPdenovo since it is fast enough to run several different assemblies in a short time and I've run Celera on a some selected assemblies.
Some results:
(Complete CEGs are how many of the 248 genes CEGMA searches for that were found complete.)
Our main interest is the presence or absence of some specific genes. So the use of CEGMA might not give me the specific information we want, but I'm pretty sure that the more genes CEGMA finds, the better assembly I have and it's easier for us to we will find what we want. Just BLASTing the raw reads for some gene sequences would probably work at even lower coverage than 6x, but then we might not be able to reconstruct the whole gene structure (all exons and introns). If we will just BLAST, FLASHed reads might be better.
We will get 150 nt reads if it's not too much more expensive than 100 nt reads. That will in a way give us 50 % more coverage for "free".
So, what do you think? Is it better to go for overlapping reads or not? Do you have any thoughts on which assembler to use? (I've briefly looked at LOCAS, an assembler for low coverage sequencing, but it's not parallelized and might be a bit slow.)
Sincerely,
Ole
we are planning to sequence a eukaryote without a reference genome, with a genome size of around 1 GB. We don't have the resources to do a deep sequencing of it, but are thinking about hitting something like 6x-20x coverage, depending on what we want to go for.
Our first problem is deciding which sequencing strategy we want. Basically, we're looking at either overlapping reads (sequence 100 nt from each end of a fragment of 180 bp to get an overlap of 20 bp, or 150 nt from each end of a fragment of 250 bp or thereabouts) or non-overlapping reads (sequence 100 nt or 150 nt from each end of a 300-500 bp fragment.)
The main argument for overlapping reads seems to be that we can use something like FLASH (http://bioinformatics.oxfordjournals...tr507.abstract, http://www.cbcb.umd.edu/software/flash/) to combine the paired read, and the basically get a "super"-read of 180-250 bp with higher per base quality than with to non-overlapping reads.
The main argument for non-overlapping reads seems to be that we can scaffold a bit, get a longer range contiguity than with overlapping reads and get a bit longer contigs too.
I've been testing a bit on the parrot dataset (http://assemblathon.org/pages/download-data) since that is one of the few datasets I've found that have both kinds of reads. I've simulated pseudo 100 nt reads by removing the first 20 bases, and the last 30 bases of the two datasets with 220 bp fragments and 500 bp fragments, resulting in 180 bp fragments with 20 bp overlap and 460 bp fragments without overlap. I've also done some testing with the original data (i.e. the 150 nt reads).
I've run CEGMA (http://nar.oxfordjournals.org/content/37/1/289.abstract, http://korflab.ucdavis.edu/Datasets/cegma/) to get a feel of how complete my assembly is in addition to the regular N50. I've mostly run SOAPdenovo since it is fast enough to run several different assemblies in a short time and I've run Celera on a some selected assemblies.
Some results:
Code:
SOAPdenovo, 180 bp fragments, 10x coverage, k-mer of 31, 100 nt read length: Contig N50: 1618 bp, scaffold N50: 2001 bp, complete CEGs: 38, partial CEGs: 142 SOAPdenovo, 180 bp fragments, 8x coverage, k-mer of 31, 100 nt read length: Contig N50: 1618 bp, scaffold N50: 2001 bp, complete CEGs: 21, partial CEGs: 127 Celera, 180 bp fragments, 8x coverage, 100 nt read length: Contig N50: 2161 bp, scaffold N50: 2218 bp, complete CEGs: 28, partial CEGs: 114 SOAPdenovo, 180 bp fragments, 10x coverage, k-mer of 31, FLASHed 100 nt reads: Contig N50: 902 bp, scaffold N50: 903 bp, complete CEGs: 7, partial CEGs: 80 SOAPdenovo, 180 bp fragments, 8x coverage, k-mer of 31, FLASHed 100 nt reads: Contig N50: 965 bp, scaffold N50: 966 bp, complete CEGs: 11, partial CEGs: 88 Celera, 220 bp fragments, 10x coverage, FLASHed 150 nt reads: Contig N50: 4085 bp, scaffold N50: 4100 bp, complete CEGs: 66, partial CEGs: 177 SOAPdenovo, 460 bp fragments, 10x coverage, k-mer of 31, 100 nt read length: Contig N50: 1998 bp, scaffold N50: 12822 bp, complete CEGs: 92, partial CEGs: 189 Celera, 460 bp fragments, 10x coverage, 100 nt read length: Contig N50: 3853 bp, scaffold N50: 15641 bp, complete CEGs: 87, partial CEGs: 195 SOAPdenovo, 460 bp fragments, 8x coverage, k-mer of 31, 100 nt read length: Contig N50: 1548 bp, scaffold N50: 12832 bp, complete CEGs: 80, partial CEGs: 183
Our main interest is the presence or absence of some specific genes. So the use of CEGMA might not give me the specific information we want, but I'm pretty sure that the more genes CEGMA finds, the better assembly I have and it's easier for us to we will find what we want. Just BLASTing the raw reads for some gene sequences would probably work at even lower coverage than 6x, but then we might not be able to reconstruct the whole gene structure (all exons and introns). If we will just BLAST, FLASHed reads might be better.
We will get 150 nt reads if it's not too much more expensive than 100 nt reads. That will in a way give us 50 % more coverage for "free".
So, what do you think? Is it better to go for overlapping reads or not? Do you have any thoughts on which assembler to use? (I've briefly looked at LOCAS, an assembler for low coverage sequencing, but it's not parallelized and might be a bit slow.)
Sincerely,
Ole
Comment