Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome Assembly

    Hello,
    I am trying to assemble a genome (estimated at ~28mb) and I have the following types of sequencing data:
    454 reads (~3 million reads)
    Illumina Single End 50 bp (~45 million reads )
    Illumina Paired End 50 bp 2kb insert (~40 millions pairs)
    Illumina Single end RNA-seq (multiple conditions pooled) (~50 million reads)


    I am looking for assembly software that can take in the multiple different data types and create a single assembled genome. Previously it has been done using different assemblers for each data type and then merging assemblies--however I imagine that a single assembler which is given the different sequencing datasets at one time would produce a better assembly than simply merging the assemblies together.

    If this doesn't exist, or is discussed in another thread then please point me in the right direction.

    Cheers,
    Phil

  • #2
    Is that a typo or is the genome really only about 28 mb? A genome that small is considered pretty easy as far as assembling goes.
    With that much data you should have several hundred fold coverage from each one of the data sets listed. I would leave the RNA-seq data out of the genome assembly.
    Anyway, this thread should help you.

    Comment


    • #3
      First off, that thread Jeremy linked too is very good. I've found some guidance in that same place.

      Personally, my first strategy would probably be to leave the 454 alone for now, you have plenty of illumina coverage. So, first assemble the illumina data with something like ABySS or SOAP, including doing scaffolding. Then, throw the 454 data in to fill gaps (BASE clear has a stand alone that I believe takes 454).

      If that doesn't work out as you need, which I doubt, you could assembly only contigs from both the illumina and 454 separately, merge them with something like CAP3. Then scaffold and gap fill again using stand alone programs.

      Alternatively you could give all types of data to Ray and assemble them together. Ideally, you'd do all three methods and compare what you get. Don't just trust simple stats like N50 or NG50. I'd suggest aligning your genome assemblies to what ever is the most closely related species with a high quality genome and visualizing it some how. BWA-SW could help you with this, as could something like lastz or MUMmer. With a genome that small you should be able to get a decent sense of how the assembly is going by just scrolling along the alignments in IGV and checking for any sort of funny business (yes, that's the technical term).

      Ignore the RNA seq data until you have a genome that you like, then align the reads to that genome to aid in the annotation process. You could also de novo assembly the RNA into transcripts and align to the genome, or do both. Maker is a nice program to guiding your though annotating your genome. Incorporating RNA-seq into a genome assembly could prove useful one day, but its pretty difficult to do now. Though, the RNA-seq alignments and/or de novo assembled transcript alignments will also help you in determining the quality of your assemblies. Ie. gaps or misassembles in the genome will interfere with transcript and raw read alignments, which you can also visualize in IGV. So you may want to carry through this far with all major versions of your assembly to see which ones contain the most complete genes.

      Good luck!
      Last edited by Wallysb01; 08-06-2012, 11:43 PM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin


        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
        Yesterday, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      44 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      43 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      38 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Working...
      X