Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Initial Assembly Help - Short Reads Large Genome

    Hello everyone,

    I'm thinking about trying my hand at assembling some ~ 300 MBp genomes using 100 bp paired-end Illumina reads (coverage is not huge ~ 8-12X). I'm not new to bioinformatics, but new to this area generally. Does anyone have any preferences for which assemblers work best for this kind of data? I'm not sure if I should attempt to use a ref genome for a guide since the most closely related species is not that closely related.

    Any pointers much appreciated.

    Thanks,

    Michael

  • #2
    How repetitive is it? How close is the reference genome? Is it highly polymorphic? Diploid/tetraploid/hexaploid?

    Comment


    • #3
      Hi Jimmy,

      Thanks for replying. I'm working with Lepitdopteran genomes (butterflies and moths), which for current ref genomes are reported to have fairly high repetitive content. The closest ref genomes are Manduca, Heliconius, Bombyx and Monarch. When I blastn conserved genes against these, the best similarity hits are ~90%, the worst are ~65%. The genomes are diploid.

      My current strategy is to first filter raw paired-end reads for low quality/adapters and then go straight into assembly with SOAPdenovo (47 kmer). I'm running this on a cluster, so will hopefully have some indication of its success in the next day or two. Any further comments welcomed.


      Best,

      Michael

      Comment


      • #4
        Few tips:

        1) error correction prior to assembly
        I used Coral (5 iterations) but you may check Quake or Reptile

        2) depending on insert sizes of your libs, you may try to find overlaps between paired ends prior to assembly with FLASH


        3) try more assemblers (Abyss, SGA, etc.), compare the results

        If you feel like experimenting:

        Comment


        • #5
          Originally posted by mabentley86 View Post
          Hi Jimmy,

          Thanks for replying. I'm working with Lepitdopteran genomes (butterflies and moths), which for current ref genomes are reported to have fairly high repetitive content. The closest ref genomes are Manduca, Heliconius, Bombyx and Monarch. When I blastn conserved genes against these, the best similarity hits are ~90%, the worst are ~65%. The genomes are diploid.

          My current strategy is to first filter raw paired-end reads for low quality/adapters and then go straight into assembly with SOAPdenovo (47 kmer). I'm running this on a cluster, so will hopefully have some indication of its success in the next day or two. Any further comments welcomed.


          Best,

          Michael
          Looks ok. I second darked89's points regarding trying different assemblers. Setup some bash scripts to experiment with different parameters too. I can't comment on error correction too much, but considering your low coverage it wouldnt be a bad option

          Have a look at the assemblathon papers to pick up some good assembly tips too

          Comment


          • #6
            You had better have more data ,not just more coverage,but also other sequence library data,just like 2000bp insert size or bigger one.A reference may be not a guide for your assembly if they are not so closed,sometimes it may introduce many errors.

            Comment


            • #7
              The tips given here are great, but with 8-12x coverage, you're not going to have a usable assembly. With sequencing errors, heterozygousity, and uneven coverage, you might only assemble 1/4 of the genome into contigs >200bp. Maybe you'd be able to get away with it if the genome had few repeats, but it sounds like that's not the case.

              What you may find is that you can map single exons of genes to your contigs, but that will make orthology assignment difficult in many cases. So with that kind of coverage, I'd stick to alignment of your reads to the most closely related species, and try to allow for some extra sequence divergence. From my experience 30x coverage is about the minimum for de novo assembly from NGS data. You could probably get way with traditional Illumina libraries if you did 300, 600 and 1000bp inserts, and avoid the mate pair libraries, at least initially. Though if you want as good of an assembly as possible, you'll need some 2-10kbp mate pair libraries, as well.

              Comment


              • #8
                Thanks for all the tips everyone, all very helpful.

                The aim of gathering this kind of data was to build low quality assemblies that could be queried for coding sequences, without worrying so much about synteny or other genome features. This is why different insert sizes weren't used. I guess the aim was to build the cheapest genome that could still prove to be useful.

                As Wallysb mentioned, the problem I'm finding is that it is difficult to assign orthology to hits. This is not such a problem for one of my genes of interest, but several of the others appear to have lineage specific duplicate copies. Here the trouble is first identifying true exon hits, and then piecing them together correctly. The process is also very manual, writing scripts wouldn't really help for this part of the process as I need to eyeball everything and tweak parameters just to be sure I'm getting the right things out.

                I am going to try some of these suggestions and see if things improve. As ever, any further comments or suggestions much appreciated.

                Michael

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                57 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                55 views
                0 likes
                Last Post seqadmin  
                Working...
                X