SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Closing gaps memartin5 Bioinformatics 2 03-24-2014 05:45 AM
Genome Gap Closing/Finishing Software akjones Bioinformatics 2 06-19-2013 01:25 PM
Bioinformatics position, U. Copenhagen, closing 10 May kamarske Academic/Non-Profit Jobs 0 05-08-2013 07:00 AM
Closing bacterial genome (7 contigs) Inma 454 Pyrosequencing 21 03-14-2013 07:21 AM
problems closing gaps phage genome rexxi 454 Pyrosequencing 4 06-13-2012 02:11 PM

Reply
 
Thread Tools
Old 09-18-2014, 08:18 AM   #1
Tom_C
Member
 
Location: New Brunswick

Join Date: Aug 2012
Posts: 16
Default Help towards closing a genome?

Hello All,

I am a graduate student trying to learn NGS as I wrap up my PhD. That said, we have sequenced our pet bacterial genome (Illumina HISeq 2500 PE 101 BP) and I have so far managed to produce what to me looks like a good assembly. Reads were cleaned up with trimmomatic and assembled using Ray-2.3.1 with a default kmer of 31. The output is as follows

Contigs >= 100 nt
Number: 28
Total length: 4963730
Average: 177276
N50: 246178
Median: 162206
Largest: 771798
Contigs >= 500 nt
Number: 28
Total length: 4963730
Average: 177276
N50: 246178
Median: 162206
Largest: 771798
Scaffolds >= 100 nt
Number: 22
Total length: 4965242
Average: 225692
N50: 338745
Median: 115189
Largest: 1908686
Scaffolds >= 500 nt
Number: 22
Total length: 4965242
Average: 225692
N50: 338745
Median: 115189
Largest: 1908686

The total length is in good agreement with other sequenced genomes of the same species (ranging 4.8-5.0 MB). But I am now beyond what anyone at my institute has experience with. I would like to go as far as possible towards closing the genome, but I am unsure what next steps to take. Can anyone provide some input as to what next logical steps I should take? Thank you very much!
Tom_C is offline   Reply With Quote
Old 09-19-2014, 06:27 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

First ask the question why do you want/need a finished genome? How much time and money can you spend on getting one?

If you only care about one or two regions of interest, it may be cost effective to do it the old fashioned way (PCR and "Sanger" capillary sequencing to close a gaps).
maubp is offline   Reply With Quote
Old 09-19-2014, 06:44 AM   #3
Tom_C
Member
 
Location: New Brunswick

Join Date: Aug 2012
Posts: 16
Default

Thanks for the reply!

I had assumed a closed, or mostly closed genome would make downstream applications much easier. We plan to do ChIP-Seq and possibly RNA-Seq with this bacterium later on, and figured having a mostly closed genome would be best.

That being said, if a closed genome is not required for these experiments we would still like to join as many contigs as possible to publish a decent draft genome. And that is where we need some expert advice.
Tom_C is offline   Reply With Quote
Old 09-19-2014, 06:57 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

A closed circle is of course nice, but if all you care about is gene content you may be fine as it is. Finishing it will cost time and money whichever route you take.
maubp is offline   Reply With Quote
Old 09-19-2014, 08:01 AM   #5
JohnN
Member
 
Location: Toronto

Join Date: Jan 2011
Posts: 30
Default

There are several approaches of varying complexity and cost:

The easiest in my recent experience, is to get PacBio sequencing done. With the illumina reads mapped to a PacBio assembly, you can close and finish the genome in about 2 days solid work. But (and there are at least two big buts), it will cost you about $1500 for the sequencing, and the PacBio assembly process is not that easy or automated, so you may have to out-source that too. But it works, and we have done it for about 30 reference genomes needed for diagnostic purposes.

You can find a very closely related genome or two, and use synteny to help you arrange your contigs (mauve, MUMmer, or reference mapping would help here), and then you can PCR close the smaller PCRable gaps. The rRNA regions will be difficult, and you could either ignore them - because they are not really that important for many studies, or generate primer sets to stitch the rRNA reads together. I've done it, it's a pain, but that's what we did in the old days.

Or, as mentioned above, you can simply use your contig set in your downstream experiments. A large proportion of the genes involved with virulence, etc, are there already. The assembler typically quits when read length of the extending reads is less than the size of a repeated region. A quick way of assessing the quality of your assembly, is to auto-annotate the genome with something like 'prokka" and look at what you have. You could probably use gap5 to join a few contigs which have some overlap, and to fix the odd frameshift, but you likely have what you need to continue your studies.

Last edited by JohnN; 09-19-2014 at 08:02 AM. Reason: typos
JohnN is offline   Reply With Quote
Old 09-19-2014, 12:00 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You already have a very good assembly, and closing the 28 remaining gaps probably won't effect many downstream programs. You will almost certainly need more data for a significant improvement - either a long-mate-pair library for better scaffolding, or PacBio for gap-filling. If you go PacBio, you may as well just run 2-3 SMRT cells and try for a complete single-contig PacBio-only assembly.
Brian Bushnell is offline   Reply With Quote
Old 09-22-2014, 03:01 AM   #7
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

I'd try first to scaffold it according to a reference, and try to determine from that how much could be missing, and if this is relevant.
Because if e.g. 3/4 of the gaps possibly consist out of 23s or stretches of tRNA, then just go and ignore it.

If the missing parts seem to be more relevant, then there are a few things to consider:
- is repeat structure a problem (doesn't seem so)
- how much is missing? If it's a bigger size, then you might need to consider a second run with not so small coverage
- is the raw material still there? Because I think (not a lab person) that a PE jumping library (4 - 8 kb should get over the rRNAs; as suggested above) can be made from the same input material, so that would save time.


You should also do some QC on your genome. It can happen (had that with Ray, HGAP and with other assemblers as well) that parts can be duplicated, which might not be obvious at first. e.g. it turned out during some other processing of one of our genomes that it had the right size (5 MB), the right amount of proteins (5k), but not the right amount of "unique" proteins (4k). Why that? One of the scaffolds was just duplicated in the output.
Check as well that there's no obvious contamination in the assembly. It doesn't help you if a good part is e.coli (or whatever).
bastianwur is offline   Reply With Quote
Old 09-23-2014, 09:34 AM   #8
Tom_C
Member
 
Location: New Brunswick

Join Date: Aug 2012
Posts: 16
Default

Thanks for the input everyone!

Unfortunately additional large scale sequencing is not in the budget for this project, so we will not be able to use mate-paired or PacBio reads to close the genome. The number of Illumi However we now know to use PacBio for all future genome projects.

Running the initial assembly through RAST indicates it is a fairly complete genome, with the correct number of proteins and a full compliment of rRNA's and tRNA's. At the suggestions of those in this thread, we plan to go ahead with ChIP and RNA-Seq using the current assembly.
Tom_C is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:24 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO