SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
Best strategy for 6 Gb de novo genome assembly? pfeutry De novo discovery 0 04-22-2013 07:56 PM
illumina/454 de novo hybrid cDNA assembly with newbler2.6 Seqasaurus Bioinformatics 2 01-23-2012 09:20 AM
De novo hybrid assembly of 454/illumina : CLC workbench Bardj Bioinformatics 1 11-21-2010 05:14 PM
De novo assembly strategy Wiseone De novo discovery 0 11-18-2010 09:30 AM
de novo hybrid assembly glacerda Bioinformatics 6 08-06-2008 10:52 PM

Reply
 
Thread Tools
Old 05-17-2014, 05:02 AM   #1
wrch
Junior Member
 
Location: india

Join Date: Jan 2014
Posts: 7
Default help needed for de novo hybrid assembly strategy

I am working on a de novo sequencing project of a yeast of size ~20 mb. I have done a 500 bp library paired end 2*150 bp illumina miseq sequencing with 50x coverage.But several portions of the genome are missing. Now I am planning to do pacbio long read sequencing 2 SMRT cells for the missing regions. My question is if I use pacbiotoca for assembly, will the long reads which contain the missing regions be filtered out because there is no short reads to correct them?
Can you suggest any alternate strategy? Assembly only with Pacbio reads is an option,but I think it requires a very high coverage of ~100x which is out of my budget.
wrch is offline   Reply With Quote
Old 05-17-2014, 08:53 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Assembly with only PacBio currently requires around 100x for a single-contig bacterial assembly, but it still works with lower coverage, you'll just get a worse assembly. And chances are good that your genome is actually completely or almost completely covered with Illumina reads, just at some points they are too low coverage or the areas are too repetitive for assembly. So you may still be able to correct almost all of the PacBio data (though you'll end up with a lot less data than you started with because that process is inefficient), and thus, you may be able to get a fairly complete genome that way... though when we correct PacBio data with Illumina data, we almost always start with more than 50x Illumina. You might try estimating the genome size from the kmer frequency distribution of your Illumina data - for example, with BBNorm:
khist.sh in=reads.fq hist=histogram.txt k=31

This will give you a 31-mer depth distribution, from which you can manually determine the size of the genome. There are also tools to automate it (for example, AllPathsLG has one), though I don't know how well they work. If the estimated genome size based on kmer frequencies is the same as your expected genome size, then it is probably almost completely covered with Illumina data.

When we do hybrid fungal assemblies, we often create contigs from an Illumina fragment library, then use PBJelly to fill captured gaps. This does require captured gaps, though (but I think there may be a recent version of PBJelly that works with uncaptured gaps).

So, try scaffolding your data with your existing 500bp-insert library, and see how good the scaffolding is; if you end up with ~20MB of scaffolds, you should be able to just use PBJelly with PacBio data to fill them in. Otherwise a long mate pair library is useful for scaffolding prior to filling gaps with PBJelly, but that's expensive too.
Brian Bushnell is offline   Reply With Quote
Old 05-17-2014, 11:06 AM   #3
wrch
Junior Member
 
Location: india

Join Date: Jan 2014
Posts: 7
Default

Thanks for your reply. I have read somewhere another strategy- producing assembly with the illumina and pacbio reads (low coverage) separately with gaps and then merge them with minimus2. Would it be a better approach?
wrch is offline   Reply With Quote
Old 05-18-2014, 05:01 PM   #4
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

Using the latest version of PacBio's HGAP assembler I often see single contig bacterial assemblies at ~50x coverage with a sufficiently good long insert library. The PacBio human assembly (haploid) @54x has a contig N50 of 4.4Mb and a maximum contig of 44Mb.
Given a 20Mb genome and ~350Mb per cell you should be able to hit this with 3-4 cells.
For error correction of PacBio with Illumina I would recommend ECTools over PacBioToCA, it is a lot more computationally efficient.
Separate assembly and merging can work quite well. For the PacBio assembly the latest version of HGAP (.3) allows self correction at lower coverage, but you need to be aware of the possibility of introducing missasemblies.
rhall is offline   Reply With Quote
Old 05-19-2014, 01:43 AM   #5
mbayer
Member
 
Location: Dundee, Scotland

Join Date: Mar 2009
Posts: 29
Default

Hi wrch,

you may want to consider generating PacBio CCS reads, rather than CLR. The CCS reads have a much lower error rate (somehwhere between 1 and 3% usually). This comes at the expense of length, but generally they don't need any error correction.

I am currently working on a MiSeq/PacBIO CCS dataset, and I have found out after a lot of experimentation that the best approach is this:

- run the MiSeq reads through FLASH, a 3' read overlapper (this may not apply to you if your MiSeq reads don't overlap)
- assemble the overlapped MiSeq reads with MSR-CA
- do a meta-assembly of the MiSeq MSR-CA contigs and singletons with the unassembled PacBio reads, using CAP3 (old-style Sanger OLC assembler, works really well for this)

The end result is that the PacBio reads complement the MiSeq MSR-CA contigs very nicely and connect these across gaps in many cases.

cheers

Micha
mbayer is offline   Reply With Quote
Old 05-19-2014, 10:11 AM   #6
wrch
Junior Member
 
Location: india

Join Date: Jan 2014
Posts: 7
Default

Thanks Micha for your suggestion. Can you please tell me how much data is generated per SMRT cell . I have read that ~300 mb CLR reads generated per SMRT cell. Is it true for CCS reads also?
wrch is offline   Reply With Quote
Old 05-19-2014, 10:14 AM   #7
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

Micha,
That's an interesting approach, but isn't the read length limited to such an extent that this approach would never complete even relatively simple bacterial assemblies?
http://genomebiology.com/2013/14/9/R101
http://genomebiology.com/content/sup...9-r101-s2.html CCS read length could be approximated to C1 in this plot.
The whole advantage of PacBio for assembly is long range information, which is lost when using CCS reads.
Richard.

Last edited by rhall; 05-19-2014 at 10:16 AM.
rhall is offline   Reply With Quote
Old 05-19-2014, 10:22 AM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by rhall View Post
Micha,
That's an interesting approach, but isn't the read length limited to such an extent that this approach would never complete even relatively simple bacterial assemblies?
http://genomebiology.com/2013/14/9/R101
http://genomebiology.com/content/sup...9-r101-s2.html CCS read length could be approximated to C1 in this plot.
The whole advantage of PacBio for assembly is long range information, which is lost when using CCS reads.
Richard.
PacBio now generates "Reads of Insert" which are basically the same as CCS reads, they just name them differently for some reason. Anyway, we recently generated a bunch of these for 16s, averaging ~1500bp and mostly with accuracy of 95%-99%. PacBio reads have been getting longer quite rapidly. So, where that paper suggested shearing to 300bp - 800bp for CCS... I think now it would be better to target ~1500-2500bp if you want fairly high quality individual reads of insert. You could also, of course, target much longer reads and just assume that a lot will come out short.
Brian Bushnell is offline   Reply With Quote
Old 05-19-2014, 10:34 AM   #9
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

Even with an optimal insert size CCS (Read of Insert) will not give you the long range information needed for assembly. To maximize throughput and get the best CCS yield at high number of passes the resulting read length distribution will be somewhere around the C1 distribution in that plot, at which very few bacterial assembles can be completed. To complete many relatively simple bacterial assembles you need long range information on the order of ~5kbp, >5kbp CCS reads are going to be rare.
I don't see a compelling use for CCS / Reads of Insert in assembly.
rhall is offline   Reply With Quote
Old 05-26-2014, 01:29 AM   #10
mbayer
Member
 
Location: Dundee, Scotland

Join Date: Mar 2009
Posts: 29
Default

My experience of this has been that the CCS reads - even though they are short by comparison with CLR - complement the Illumina reads nicely and bridge a lot of the gaps between contigs that have low-complexity sequence at the ends (e.g. microsatellites, homopolymer runs), where the de novo assembly of the Illumina reads alone was insufficient. They may not be perfect for completing whole genomes, but they have certainly improved our assemblies substantially (we have been using this approach for assembly of R gene sequences from enrichment sequencing).
mbayer is offline   Reply With Quote
Old 06-26-2014, 05:42 AM   #11
scalabrin
Member
 
Location: Udine, Italy

Join Date: Jul 2009
Posts: 22
Default assembly merge

Quote:
Originally Posted by wrch View Post
Thanks for your reply. I have read somewhere another strategy- producing assembly with the illumina and pacbio reads (low coverage) separately with gaps and then merge them with minimus2. Would it be a better approach?
If you still need to do assembly merge, you can use GAM-NGS:

http://www.biomedcentral.com/1471-2105/14/S7/S6

Best,
Simone
scalabrin is offline   Reply With Quote
Old 08-10-2014, 02:59 AM   #12
akorobeynikov
Member
 
Location: Saint Petersburg, Russia

Join Date: Sep 2013
Posts: 25
Default

Quote:
Originally Posted by wrch View Post
Can you suggest any alternate strategy? Assembly only with Pacbio reads is an option,but I think it requires a very high coverage of ~100x which is out of my budget.
You may try to use SPAdes for hybrid Illumina + PacBio assembly. It will happily use your PacBio data both for filling in unrepresented parts in your Illumina data and resolve repeats.
akorobeynikov is offline   Reply With Quote
Old 03-24-2016, 12:09 PM   #13
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Any advices for bigger genome ? (GB)
moistplus is offline   Reply With Quote
Old 03-24-2016, 12:35 PM   #14
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by moistplus View Post
Any advices for bigger genome ? (GB)
Look at PacBio's hybrid assembly page:

https://github.com/PacificBioscience...Bio-Long-Reads

Given pre-existing Illumina-based assemblies and scaffolding with PacBio I haven't had much luck with PBJelly but (and this is ongoing work) I have hopes for the AHA program.
westerman is offline   Reply With Quote
Old 03-24-2016, 01:29 PM   #15
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Iam a bit confused about hybrid assembly.

What is the strategy commonly used ?

1) Assemble illumina data and scaffold with pacbio reads ?
2) Assemble pacbio reads and scaffold with illumina reads ?
3) Correct pacbio reads with illumina reads and assemble the result.

Maybe other strategies ??
moistplus is offline   Reply With Quote
Old 03-24-2016, 01:33 PM   #16
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

All.

Realistically it hard to tell which approach will work best. And to some extent it depends on how much time you have and the resources you already have on hand. In my case I will almost always have some sort of Illumina-based assembly first because we are an Illumina shop. Then if I get PacBio reads layering them on top of the Illumina assembly makes sense. But other people may get PacBio reads first, find out that they are not assembling 100% and so go out and get Illumina reads for scaffolding.
westerman is offline   Reply With Quote
Old 03-24-2016, 01:43 PM   #17
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

westerman: What is your issue with PBJelly? It is generally robust, given a good quality draft genome as input for gap filling. AHA is very old and no longer available / supported.

moistplus:
Hybrid assembly isn't very common, denovo scaffolding with pacbio reads is not generally recommended. Gap filling with PBJelly can work really well, but the input illumina assembly must be high quality. Assembling illumina data and PacBio data together can be successful, this is generally what people consider a hybrid assembly. Older methods (pacBioToCA, ECTools) corrected pacbio reads with illumina data, then assemble using standard OLC methodologies. Recent implementations dbg2olc, MaSuRCA use a much more efficient approach, generally bulding the illumina assembly graph before using pacbio data to resolve repeats in the graph.
By far the best results are from pacbio only denovo assembly.
rhall is offline   Reply With Quote
Old 03-24-2016, 02:08 PM   #18
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

PBJelly keeps crashing on me. Previous jobs (with potentially poor Illumina assemblies) produced poor results. That said I do have a current job that has been running for three days without crashing so I am hoping for the best. That job is a small variant on jobs that have crashed on me but maybe I got everything right this time. As is typical for bioinformatics program the troubleshooting messages are not very helpful.

I do realize that PBJelly is newer than AHA but one has to do with what actually runs.

Looks like you just removed the AHA option from the web page I posted above. Your first edit of that page.


While I haven't used MaSrRCA in a while it seems to me that it is basically doing an Illumina assembly and then layering -- scaffolding, gap-filling, whatever -- on top of the Illumina assembly. Same category


I point out that while PacBio only assembly is nice it doesn't really work for a 1GB genome unless one has a lot of $$$.
westerman is offline   Reply With Quote
Old 03-24-2016, 02:47 PM   #19
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 322
Default

I removed AHA, as I cannnot think of any situation for which I would recommend it. It doesn't work for large genomes (>200Mb), and for smaller genomes if you have enough data to use it you have more than enough to try hybrid assembly, or even low coverage denovo PacBio (>25x).
I would consider scaffolding and resolving ambiguity during assembly to be two different approaches.
Cost for a 1Gb genome is probably not as much as you think, and if you factor in analysis time, it can be competitive. A recent example avian genome, great library, 60 SMRT Cells, from raw data to finished genome ~2 days, >15Mb contig N50. Obviously not everything works out quite so well, but 1Gb genomes are becoming routine.
rhall is offline   Reply With Quote
Old 03-24-2016, 03:30 PM   #20
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Do you think pacbio data can deal with polyploid genome ? I mean assemble reads from polyploid genome.

Do you know any tools ?
moistplus is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:18 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO