SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
Integrated mRNA and microRNA sequencing strategy xguo Bioinformatics 0 01-16-2012 09:40 AM
PubMed: Strategy for modular tagged high-throughput amplicon sequencing. Newsbot! Literature Watch 0 12-14-2011 03:30 AM
Maq Ė how to filter low coverage areas? icg Bioinformatics 0 01-06-2011 01:49 AM
low 454 coverage combined with high solexa coverage strob Bioinformatics 7 10-07-2010 11:14 AM
PubMed: Low-coverage massively parallel pyrosequencing of cDNAs enables proteomics in Newsbot! Literature Watch 0 04-09-2008 06:31 AM

Reply
 
Thread Tools
Old 02-23-2012, 06:35 AM   #1
Ole
Member
 
Location: Oslo, Norway

Join Date: Oct 2011
Posts: 17
Default Low coverage sequencing, which strategy?

Hi,
we are planning to sequence a eukaryote without a reference genome, with a genome size of around 1 GB. We don't have the resources to do a deep sequencing of it, but are thinking about hitting something like 6x-20x coverage, depending on what we want to go for.

Our first problem is deciding which sequencing strategy we want. Basically, we're looking at either overlapping reads (sequence 100 nt from each end of a fragment of 180 bp to get an overlap of 20 bp, or 150 nt from each end of a fragment of 250 bp or thereabouts) or non-overlapping reads (sequence 100 nt or 150 nt from each end of a 300-500 bp fragment.)

The main argument for overlapping reads seems to be that we can use something like FLASH (http://bioinformatics.oxfordjournals...tr507.abstract, http://www.cbcb.umd.edu/software/flash/) to combine the paired read, and the basically get a "super"-read of 180-250 bp with higher per base quality than with to non-overlapping reads.

The main argument for non-overlapping reads seems to be that we can scaffold a bit, get a longer range contiguity than with overlapping reads and get a bit longer contigs too.

I've been testing a bit on the parrot dataset (http://assemblathon.org/pages/download-data) since that is one of the few datasets I've found that have both kinds of reads. I've simulated pseudo 100 nt reads by removing the first 20 bases, and the last 30 bases of the two datasets with 220 bp fragments and 500 bp fragments, resulting in 180 bp fragments with 20 bp overlap and 460 bp fragments without overlap. I've also done some testing with the original data (i.e. the 150 nt reads).

I've run CEGMA (http://nar.oxfordjournals.org/content/37/1/289.abstract, http://korflab.ucdavis.edu/Datasets/cegma/) to get a feel of how complete my assembly is in addition to the regular N50. I've mostly run SOAPdenovo since it is fast enough to run several different assemblies in a short time and I've run Celera on a some selected assemblies.

Some results:

Code:
SOAPdenovo, 180 bp fragments, 10x coverage, k-mer of 31, 100 nt read length: Contig N50: 1618 bp, scaffold N50: 2001 bp, complete CEGs: 38, partial CEGs: 142
SOAPdenovo, 180 bp fragments, 8x coverage, k-mer of 31, 100 nt read length: Contig N50: 1618 bp, scaffold N50: 2001 bp, complete CEGs: 21, partial CEGs: 127
Celera, 180 bp fragments, 8x coverage, 100 nt read length: Contig N50: 2161 bp, scaffold N50: 2218 bp, complete CEGs: 28, partial CEGs: 114

SOAPdenovo, 180 bp fragments, 10x coverage, k-mer of 31, FLASHed 100 nt reads: Contig N50: 902 bp, scaffold N50: 903 bp, complete CEGs: 7, partial CEGs: 80
SOAPdenovo, 180 bp fragments, 8x coverage, k-mer of 31, FLASHed 100 nt reads: Contig N50: 965 bp, scaffold N50: 966 bp, complete CEGs: 11, partial CEGs: 88

Celera, 220 bp fragments, 10x coverage, FLASHed 150 nt reads: Contig N50: 4085 bp, scaffold N50: 4100 bp, complete CEGs: 66, partial CEGs: 177

SOAPdenovo, 460 bp fragments, 10x coverage, k-mer of 31, 100 nt read length: Contig N50: 1998 bp, scaffold N50: 12822 bp, complete CEGs: 92, partial CEGs: 189
Celera, 460 bp fragments, 10x coverage, 100 nt read length: Contig N50: 3853 bp, scaffold N50: 15641 bp, complete CEGs: 87, partial CEGs: 195
SOAPdenovo, 460 bp fragments, 8x coverage, k-mer of 31, 100 nt read length: Contig N50: 1548 bp, scaffold N50: 12832 bp, complete CEGs: 80, partial CEGs: 183
(Complete CEGs are how many of the 248 genes CEGMA searches for that were found complete.)

Our main interest is the presence or absence of some specific genes. So the use of CEGMA might not give me the specific information we want, but I'm pretty sure that the more genes CEGMA finds, the better assembly I have and it's easier for us to we will find what we want. Just BLASTing the raw reads for some gene sequences would probably work at even lower coverage than 6x, but then we might not be able to reconstruct the whole gene structure (all exons and introns). If we will just BLAST, FLASHed reads might be better.

We will get 150 nt reads if it's not too much more expensive than 100 nt reads. That will in a way give us 50 % more coverage for "free".

So, what do you think? Is it better to go for overlapping reads or not? Do you have any thoughts on which assembler to use? (I've briefly looked at LOCAS, an assembler for low coverage sequencing, but it's not parallelized and might be a bit slow.)

Sincerely,
Ole
Ole is offline   Reply With Quote
Old 02-23-2012, 08:23 AM   #2
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Is it better to go for overlapping reads or not?
You have done your homework quite thoroughly. I think that you have the answer in your simulation: those 460 bp fragments (non-overlapping) are much better than the 180-220 bp fragments in the metrics that matter -- CEGs found and N50 scaffold. The issue of using FLASHed reads in blasts is, in my mind, a side issue.

A question. If you are interested in genes then perhaps a transcriptome instead of a whole genome project would be better? At least it would give deeper coverage.

As for the assembler, I prefer ABySS but that is just because I am use to it. Your choices are good. A longer kmer size might improve the overall assembly.

Once again, thumbs up (in the positive sense) of doing your background work before doing the actual sequencing.
westerman is offline   Reply With Quote
Old 02-25-2012, 05:14 AM   #3
Ole
Member
 
Location: Oslo, Norway

Join Date: Oct 2011
Posts: 17
Default

Quote:
You have done your homework quite thoroughly. I think that you have the answer in your simulation: those 460 bp fragments (non-overlapping) are much better than the 180-220 bp fragments in the metrics that matter -- CEGs found and N50 scaffold. The issue of using FLASHed reads in blasts is, in my mind, a side issue.
I'm coming to the same conclusion, but I'm running some assemblies on a bit higher coverage (15x and 20x) to see what happens then, and changing the parameters a bit. At that coverage I can error correct with Quake for example.

Quote:
A question. If you are interested in genes then perhaps a transcriptome instead of a whole genome project would be better? At least it would give deeper coverage.
That's a good idea, but you are not sure about the presence/absence of genes then, are you? You have to get the right tissue under the right conditions at the right time, and that can be hard.

Quote:
As for the assembler, I prefer ABySS but that is just because I am use to it. Your choices are good. A longer kmer size might improve the overall assembly.
I'll look into that. For now I'm running SOAPdenovo on all the different setups, and select some promising setups to run with Celera. On Twitter last night, we had some discussion about this and SGA came up as a promising assembler too. OLC assemblers might be the best for this case.

Quote:
Once again, thumbs up (in the positive sense) of doing your background work before doing the actual sequencing.
Thank you, the better ground work, the easier the actual doing is.

I'll come back with some more thoughts and examples of setups and results when I have it.
Ole is offline   Reply With Quote
Old 02-29-2012, 10:21 AM   #4
NextGenSeq
Senior Member
 
Location: USA

Join Date: Apr 2009
Posts: 482
Default

You may want to investigate PASHA. De novo assembly of short reads needs a ton of RAM

http://sites.google.com/site/yongchaosoftware/pasha

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167803/
NextGenSeq is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:52 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO