Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help towards closing a genome?

    Hello All,

    I am a graduate student trying to learn NGS as I wrap up my PhD. That said, we have sequenced our pet bacterial genome (Illumina HISeq 2500 PE 101 BP) and I have so far managed to produce what to me looks like a good assembly. Reads were cleaned up with trimmomatic and assembled using Ray-2.3.1 with a default kmer of 31. The output is as follows

    Contigs >= 100 nt
    Number: 28
    Total length: 4963730
    Average: 177276
    N50: 246178
    Median: 162206
    Largest: 771798
    Contigs >= 500 nt
    Number: 28
    Total length: 4963730
    Average: 177276
    N50: 246178
    Median: 162206
    Largest: 771798
    Scaffolds >= 100 nt
    Number: 22
    Total length: 4965242
    Average: 225692
    N50: 338745
    Median: 115189
    Largest: 1908686
    Scaffolds >= 500 nt
    Number: 22
    Total length: 4965242
    Average: 225692
    N50: 338745
    Median: 115189
    Largest: 1908686

    The total length is in good agreement with other sequenced genomes of the same species (ranging 4.8-5.0 MB). But I am now beyond what anyone at my institute has experience with. I would like to go as far as possible towards closing the genome, but I am unsure what next steps to take. Can anyone provide some input as to what next logical steps I should take? Thank you very much!

  • #2
    First ask the question why do you want/need a finished genome? How much time and money can you spend on getting one?

    If you only care about one or two regions of interest, it may be cost effective to do it the old fashioned way (PCR and "Sanger" capillary sequencing to close a gaps).

    Comment


    • #3
      Thanks for the reply!

      I had assumed a closed, or mostly closed genome would make downstream applications much easier. We plan to do ChIP-Seq and possibly RNA-Seq with this bacterium later on, and figured having a mostly closed genome would be best.

      That being said, if a closed genome is not required for these experiments we would still like to join as many contigs as possible to publish a decent draft genome. And that is where we need some expert advice.

      Comment


      • #4
        A closed circle is of course nice, but if all you care about is gene content you may be fine as it is. Finishing it will cost time and money whichever route you take.

        Comment


        • #5
          There are several approaches of varying complexity and cost:

          The easiest in my recent experience, is to get PacBio sequencing done. With the illumina reads mapped to a PacBio assembly, you can close and finish the genome in about 2 days solid work. But (and there are at least two big buts), it will cost you about $1500 for the sequencing, and the PacBio assembly process is not that easy or automated, so you may have to out-source that too. But it works, and we have done it for about 30 reference genomes needed for diagnostic purposes.

          You can find a very closely related genome or two, and use synteny to help you arrange your contigs (mauve, MUMmer, or reference mapping would help here), and then you can PCR close the smaller PCRable gaps. The rRNA regions will be difficult, and you could either ignore them - because they are not really that important for many studies, or generate primer sets to stitch the rRNA reads together. I've done it, it's a pain, but that's what we did in the old days.

          Or, as mentioned above, you can simply use your contig set in your downstream experiments. A large proportion of the genes involved with virulence, etc, are there already. The assembler typically quits when read length of the extending reads is less than the size of a repeated region. A quick way of assessing the quality of your assembly, is to auto-annotate the genome with something like 'prokka" and look at what you have. You could probably use gap5 to join a few contigs which have some overlap, and to fix the odd frameshift, but you likely have what you need to continue your studies.
          Last edited by JohnN; 09-19-2014, 07:02 AM. Reason: typos

          Comment


          • #6
            You already have a very good assembly, and closing the 28 remaining gaps probably won't effect many downstream programs. You will almost certainly need more data for a significant improvement - either a long-mate-pair library for better scaffolding, or PacBio for gap-filling. If you go PacBio, you may as well just run 2-3 SMRT cells and try for a complete single-contig PacBio-only assembly.

            Comment


            • #7
              I'd try first to scaffold it according to a reference, and try to determine from that how much could be missing, and if this is relevant.
              Because if e.g. 3/4 of the gaps possibly consist out of 23s or stretches of tRNA, then just go and ignore it.

              If the missing parts seem to be more relevant, then there are a few things to consider:
              - is repeat structure a problem (doesn't seem so)
              - how much is missing? If it's a bigger size, then you might need to consider a second run with not so small coverage
              - is the raw material still there? Because I think (not a lab person) that a PE jumping library (4 - 8 kb should get over the rRNAs; as suggested above) can be made from the same input material, so that would save time.


              You should also do some QC on your genome. It can happen (had that with Ray, HGAP and with other assemblers as well) that parts can be duplicated, which might not be obvious at first. e.g. it turned out during some other processing of one of our genomes that it had the right size (5 MB), the right amount of proteins (5k), but not the right amount of "unique" proteins (4k). Why that? One of the scaffolds was just duplicated in the output.
              Check as well that there's no obvious contamination in the assembly. It doesn't help you if a good part is e.coli (or whatever).

              Comment


              • #8
                Thanks for the input everyone!

                Unfortunately additional large scale sequencing is not in the budget for this project, so we will not be able to use mate-paired or PacBio reads to close the genome. The number of Illumi However we now know to use PacBio for all future genome projects.

                Running the initial assembly through RAST indicates it is a fairly complete genome, with the correct number of proteins and a full compliment of rRNA's and tRNA's. At the suggestions of those in this thread, we plan to go ahead with ChIP and RNA-Seq using the current assembly.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                29 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X