Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help towards closing a genome?

    Hello All,

    I am a graduate student trying to learn NGS as I wrap up my PhD. That said, we have sequenced our pet bacterial genome (Illumina HISeq 2500 PE 101 BP) and I have so far managed to produce what to me looks like a good assembly. Reads were cleaned up with trimmomatic and assembled using Ray-2.3.1 with a default kmer of 31. The output is as follows

    Contigs >= 100 nt
    Number: 28
    Total length: 4963730
    Average: 177276
    N50: 246178
    Median: 162206
    Largest: 771798
    Contigs >= 500 nt
    Number: 28
    Total length: 4963730
    Average: 177276
    N50: 246178
    Median: 162206
    Largest: 771798
    Scaffolds >= 100 nt
    Number: 22
    Total length: 4965242
    Average: 225692
    N50: 338745
    Median: 115189
    Largest: 1908686
    Scaffolds >= 500 nt
    Number: 22
    Total length: 4965242
    Average: 225692
    N50: 338745
    Median: 115189
    Largest: 1908686

    The total length is in good agreement with other sequenced genomes of the same species (ranging 4.8-5.0 MB). But I am now beyond what anyone at my institute has experience with. I would like to go as far as possible towards closing the genome, but I am unsure what next steps to take. Can anyone provide some input as to what next logical steps I should take? Thank you very much!

  • #2
    First ask the question why do you want/need a finished genome? How much time and money can you spend on getting one?

    If you only care about one or two regions of interest, it may be cost effective to do it the old fashioned way (PCR and "Sanger" capillary sequencing to close a gaps).

    Comment


    • #3
      Thanks for the reply!

      I had assumed a closed, or mostly closed genome would make downstream applications much easier. We plan to do ChIP-Seq and possibly RNA-Seq with this bacterium later on, and figured having a mostly closed genome would be best.

      That being said, if a closed genome is not required for these experiments we would still like to join as many contigs as possible to publish a decent draft genome. And that is where we need some expert advice.

      Comment


      • #4
        A closed circle is of course nice, but if all you care about is gene content you may be fine as it is. Finishing it will cost time and money whichever route you take.

        Comment


        • #5
          There are several approaches of varying complexity and cost:

          The easiest in my recent experience, is to get PacBio sequencing done. With the illumina reads mapped to a PacBio assembly, you can close and finish the genome in about 2 days solid work. But (and there are at least two big buts), it will cost you about $1500 for the sequencing, and the PacBio assembly process is not that easy or automated, so you may have to out-source that too. But it works, and we have done it for about 30 reference genomes needed for diagnostic purposes.

          You can find a very closely related genome or two, and use synteny to help you arrange your contigs (mauve, MUMmer, or reference mapping would help here), and then you can PCR close the smaller PCRable gaps. The rRNA regions will be difficult, and you could either ignore them - because they are not really that important for many studies, or generate primer sets to stitch the rRNA reads together. I've done it, it's a pain, but that's what we did in the old days.

          Or, as mentioned above, you can simply use your contig set in your downstream experiments. A large proportion of the genes involved with virulence, etc, are there already. The assembler typically quits when read length of the extending reads is less than the size of a repeated region. A quick way of assessing the quality of your assembly, is to auto-annotate the genome with something like 'prokka" and look at what you have. You could probably use gap5 to join a few contigs which have some overlap, and to fix the odd frameshift, but you likely have what you need to continue your studies.
          Last edited by JohnN; 09-19-2014, 07:02 AM. Reason: typos

          Comment


          • #6
            You already have a very good assembly, and closing the 28 remaining gaps probably won't effect many downstream programs. You will almost certainly need more data for a significant improvement - either a long-mate-pair library for better scaffolding, or PacBio for gap-filling. If you go PacBio, you may as well just run 2-3 SMRT cells and try for a complete single-contig PacBio-only assembly.

            Comment


            • #7
              I'd try first to scaffold it according to a reference, and try to determine from that how much could be missing, and if this is relevant.
              Because if e.g. 3/4 of the gaps possibly consist out of 23s or stretches of tRNA, then just go and ignore it.

              If the missing parts seem to be more relevant, then there are a few things to consider:
              - is repeat structure a problem (doesn't seem so)
              - how much is missing? If it's a bigger size, then you might need to consider a second run with not so small coverage
              - is the raw material still there? Because I think (not a lab person) that a PE jumping library (4 - 8 kb should get over the rRNAs; as suggested above) can be made from the same input material, so that would save time.


              You should also do some QC on your genome. It can happen (had that with Ray, HGAP and with other assemblers as well) that parts can be duplicated, which might not be obvious at first. e.g. it turned out during some other processing of one of our genomes that it had the right size (5 MB), the right amount of proteins (5k), but not the right amount of "unique" proteins (4k). Why that? One of the scaffolds was just duplicated in the output.
              Check as well that there's no obvious contamination in the assembly. It doesn't help you if a good part is e.coli (or whatever).

              Comment


              • #8
                Thanks for the input everyone!

                Unfortunately additional large scale sequencing is not in the budget for this project, so we will not be able to use mate-paired or PacBio reads to close the genome. The number of Illumi However we now know to use PacBio for all future genome projects.

                Running the initial assembly through RAST indicates it is a fairly complete genome, with the correct number of proteins and a full compliment of rRNA's and tRNA's. At the suggestions of those in this thread, we plan to go ahead with ChIP and RNA-Seq using the current assembly.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 05-10-2024, 06:35 AM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-09-2024, 02:46 PM
                0 responses
                21 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                19 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-06-2024, 07:17 AM
                0 responses
                21 views
                0 likes
                Last Post seqadmin  
                Working...
                X