Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low coverage sequencing, which strategy?

    Hi,
    we are planning to sequence a eukaryote without a reference genome, with a genome size of around 1 GB. We don't have the resources to do a deep sequencing of it, but are thinking about hitting something like 6x-20x coverage, depending on what we want to go for.

    Our first problem is deciding which sequencing strategy we want. Basically, we're looking at either overlapping reads (sequence 100 nt from each end of a fragment of 180 bp to get an overlap of 20 bp, or 150 nt from each end of a fragment of 250 bp or thereabouts) or non-overlapping reads (sequence 100 nt or 150 nt from each end of a 300-500 bp fragment.)

    The main argument for overlapping reads seems to be that we can use something like FLASH (http://bioinformatics.oxfordjournals...tr507.abstract, http://www.cbcb.umd.edu/software/flash/) to combine the paired read, and the basically get a "super"-read of 180-250 bp with higher per base quality than with to non-overlapping reads.

    The main argument for non-overlapping reads seems to be that we can scaffold a bit, get a longer range contiguity than with overlapping reads and get a bit longer contigs too.

    I've been testing a bit on the parrot dataset (http://assemblathon.org/pages/download-data) since that is one of the few datasets I've found that have both kinds of reads. I've simulated pseudo 100 nt reads by removing the first 20 bases, and the last 30 bases of the two datasets with 220 bp fragments and 500 bp fragments, resulting in 180 bp fragments with 20 bp overlap and 460 bp fragments without overlap. I've also done some testing with the original data (i.e. the 150 nt reads).

    I've run CEGMA (http://nar.oxfordjournals.org/content/37/1/289.abstract, http://korflab.ucdavis.edu/Datasets/cegma/) to get a feel of how complete my assembly is in addition to the regular N50. I've mostly run SOAPdenovo since it is fast enough to run several different assemblies in a short time and I've run Celera on a some selected assemblies.

    Some results:

    Code:
    SOAPdenovo, 180 bp fragments, 10x coverage, k-mer of 31, 100 nt read length: Contig N50: 1618 bp, scaffold N50: 2001 bp, complete CEGs: 38, partial CEGs: 142
    SOAPdenovo, 180 bp fragments, 8x coverage, k-mer of 31, 100 nt read length: Contig N50: 1618 bp, scaffold N50: 2001 bp, complete CEGs: 21, partial CEGs: 127
    Celera, 180 bp fragments, 8x coverage, 100 nt read length: Contig N50: 2161 bp, scaffold N50: 2218 bp, complete CEGs: 28, partial CEGs: 114
    
    SOAPdenovo, 180 bp fragments, 10x coverage, k-mer of 31, FLASHed 100 nt reads: Contig N50: 902 bp, scaffold N50: 903 bp, complete CEGs: 7, partial CEGs: 80
    SOAPdenovo, 180 bp fragments, 8x coverage, k-mer of 31, FLASHed 100 nt reads: Contig N50: 965 bp, scaffold N50: 966 bp, complete CEGs: 11, partial CEGs: 88
    
    Celera, 220 bp fragments, 10x coverage, FLASHed 150 nt reads: Contig N50: 4085 bp, scaffold N50: 4100 bp, complete CEGs: 66, partial CEGs: 177
    
    SOAPdenovo, 460 bp fragments, 10x coverage, k-mer of 31, 100 nt read length: Contig N50: 1998 bp, scaffold N50: 12822 bp, complete CEGs: 92, partial CEGs: 189
    Celera, 460 bp fragments, 10x coverage, 100 nt read length: Contig N50: 3853 bp, scaffold N50: 15641 bp, complete CEGs: 87, partial CEGs: 195
    SOAPdenovo, 460 bp fragments, 8x coverage, k-mer of 31, 100 nt read length: Contig N50: 1548 bp, scaffold N50: 12832 bp, complete CEGs: 80, partial CEGs: 183
    (Complete CEGs are how many of the 248 genes CEGMA searches for that were found complete.)

    Our main interest is the presence or absence of some specific genes. So the use of CEGMA might not give me the specific information we want, but I'm pretty sure that the more genes CEGMA finds, the better assembly I have and it's easier for us to we will find what we want. Just BLASTing the raw reads for some gene sequences would probably work at even lower coverage than 6x, but then we might not be able to reconstruct the whole gene structure (all exons and introns). If we will just BLAST, FLASHed reads might be better.

    We will get 150 nt reads if it's not too much more expensive than 100 nt reads. That will in a way give us 50 % more coverage for "free".

    So, what do you think? Is it better to go for overlapping reads or not? Do you have any thoughts on which assembler to use? (I've briefly looked at LOCAS, an assembler for low coverage sequencing, but it's not parallelized and might be a bit slow.)

    Sincerely,
    Ole

  • #2
    Is it better to go for overlapping reads or not?
    You have done your homework quite thoroughly. I think that you have the answer in your simulation: those 460 bp fragments (non-overlapping) are much better than the 180-220 bp fragments in the metrics that matter -- CEGs found and N50 scaffold. The issue of using FLASHed reads in blasts is, in my mind, a side issue.

    A question. If you are interested in genes then perhaps a transcriptome instead of a whole genome project would be better? At least it would give deeper coverage.

    As for the assembler, I prefer ABySS but that is just because I am use to it. Your choices are good. A longer kmer size might improve the overall assembly.

    Once again, thumbs up (in the positive sense) of doing your background work before doing the actual sequencing.

    Comment


    • #3
      You have done your homework quite thoroughly. I think that you have the answer in your simulation: those 460 bp fragments (non-overlapping) are much better than the 180-220 bp fragments in the metrics that matter -- CEGs found and N50 scaffold. The issue of using FLASHed reads in blasts is, in my mind, a side issue.
      I'm coming to the same conclusion, but I'm running some assemblies on a bit higher coverage (15x and 20x) to see what happens then, and changing the parameters a bit. At that coverage I can error correct with Quake for example.

      A question. If you are interested in genes then perhaps a transcriptome instead of a whole genome project would be better? At least it would give deeper coverage.
      That's a good idea, but you are not sure about the presence/absence of genes then, are you? You have to get the right tissue under the right conditions at the right time, and that can be hard.

      As for the assembler, I prefer ABySS but that is just because I am use to it. Your choices are good. A longer kmer size might improve the overall assembly.
      I'll look into that. For now I'm running SOAPdenovo on all the different setups, and select some promising setups to run with Celera. On Twitter last night, we had some discussion about this and SGA came up as a promising assembler too. OLC assemblers might be the best for this case.

      Once again, thumbs up (in the positive sense) of doing your background work before doing the actual sequencing.
      Thank you, the better ground work, the easier the actual doing is.

      I'll come back with some more thoughts and examples of setups and results when I have it.

      Comment


      • #4
        You may want to investigate PASHA. De novo assembly of short reads needs a ton of RAM



        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin


          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        45 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        46 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        39 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X