Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hybrid assembly using HiSeq and MiSeq data

    I'm trying to assemble a relatively large insect genome (~ 1.5 Gbp) and have sequencing data from two different sequencing platforms that I want to combine, in order to get the best possible assembly.

    More specifically, I have Illumina HiSeq data (2 x 100 bp), with insert size of 550 bp that give me around 40x coverage (from 4 libraries). Recently, I also sequenced one of these 550 bp libraries using the MiSeq platform (2 x 300 bp, overlapping reads). After merging of the mates I get "long" reads (most of them are >400 bp), with an estimated coverage of about 3x.

    So, what do you think is the best strategy for de novo assembly when you have sequencing data that differ that much in terms of read length and sequencing coverage?

    The reason I'm asking is because I think that pooling all reads together and trying to assemble using a kmer-based assembler will "confuse" the assembler because of the difference in sequencing coverage. Moreover, I'm also guessing that I'm not really making the most out of my longer MiSeq reads, if I use a kmer-based assembler.

    Do you think an alternative would be to assemble the HiSeq and MiSeq data separately and then combine them using an OLC (overlap-layout-consensus) assembler (instead of kmer-based one)? If so, is there such an assembler that is particularly good at this task?

    Thanks!

  • #2
    Combining reads into one set of files would not be a good idea. However assemblers such as ABySS will happily take two or more sets of files and treat each one as separate entities.

    I am not saying that ABySS is the best assembler for your work -- although it is my 'go-to' assembler for large projects -- but do suggest giving it a try. In your case I would tell ABySS that I had 4 different paired-end libraries (the HiSeq data) and a single-end library (the MiSeq merged reads).

    As an answer to your final paragraph, "Do you think an alternative would be to assemble the HiSeq and MiSeq data separately and then combine them using an OLC ...", yeah, that should work as well. minimus/bambus would be what I would use. Not sure if they are 'best' though.

    Comment


    • #3
      I would check out MaSuRCA. As input, you would give it the raw reads, not trimmed or stitched together. Each read set would be a unique library.

      I have done this with a few different genomes with varying success. When I had better success, it was generally not by sequencing a single library with longer reads (overlapped or not), it was by sequencing a new library with longer reads. My opinion on why is because you end up averaging out library prep biases when you have more libraries.

      In the end, you will probably find that you still need a completely different data type to get to a decent assembly. MP, long (>1k) reads, targeted high depth, etc.

      Comment


      • #4
        Thank you both guys for the hints!

        westerman, I'll give ABySS a go and see what I get.

        bioBob, I had tried MaSuRCA about a year ago, but was really disappointed (very, very buggy!). I heard though there's a new version that has lots of bug fixes. I think I'll also give that a try.

        Comment


        • #5
          Let us know about MaSuRCA. The one time I tried it I was disappointed as well.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Today, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          37 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          41 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          35 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          54 views
          0 likes
          Last Post seqadmin  
          Working...
          X