SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is chimeric reads a problem in de novo assembly? lamz138138 Bioinformatics 6 06-14-2016 08:58 AM
PacBio data - problem with SRA toolkit Retro Pacific Biosciences 7 12-04-2015 06:56 AM
Finding chimeric contigs in metagenome kumara Bioinformatics 0 10-16-2014 02:54 PM
Reducing potentially chimeric contigs during assembly Rzinna De novo discovery 3 04-17-2014 11:22 AM
SOPRA prebuilt contigs bam problem Aclarar Bioinformatics 0 07-23-2012 11:53 AM

Reply
 
Thread Tools
Old 11-15-2016, 10:27 AM   #1
k-gun12
Member
 
Location: NJ

Join Date: Feb 2010
Posts: 51
Default Systemic problem with PacBio data and chimeric contigs

I've got ~30x coverage of a small < 100MB algal genome using PB RSII. I corrected, assembled and polished the genome with Canu, and was pretty pleased with the results until I blasted the genome into itself and found dozens and dozens of repeated DNA regions up to and gt 50kbp that occur in multiple contigs - usually at the ends but not always. The Canu developers helped tweak my run a bit, but the problem persisted. Recently I used the same workflow with a different alga and see the exact same problem, and have recently spoken to another lab (working on corals) with identical issues using SMRTmake (not sure if it was HGAP.3 or not). It has gotten so bad that I've found chloroplast fragments assembled in with the genomic DNA contigs. Has anyone else encountered this? My runs were all done on different instruments with different extraction protocols.. is the RSII creating chimeric reads? Thanks in advance.
k-gun12 is offline   Reply With Quote
Old 11-15-2016, 11:48 AM   #2
gconcepcion
Member
 
Location: Menlo Park

Join Date: Dec 2010
Posts: 67
Default

One way to circumvent this is with the overlap_filtering_setting in FALCON. This allows you to filter out "chimeric contigs" due to the fact that overlap coverage will differ across the contig. The coverage in repetitive regions will be much higher relative to everything else.

I'm not aware of a similar setting in canu

Last edited by gconcepcion; 11-15-2016 at 11:58 AM. Reason: clarity
gconcepcion is offline   Reply With Quote
Old 11-16-2016, 11:31 AM   #3
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 314
Default

There is always a non zero chance or creating biological chimeras in sample prep, adapters are blunt end ligated to the sheared DNA therefore it is always possible that fragments ligate to one another before having adapters attached. Obviously the adapter concentration is optimized to minimize this and in general biological chimeras are extremely rare, but it is possible that mistakes in sample prep can results in much higher numbers.
Even if biological chimers do occur they are random so should not have support from other reads i.e. the first step of assembly corrects them. But in cases of bad sample prep it is possible that chimeras, due to their large number, pass correction and result in miss-assemblies. As pointed out in the above post preassembly can be parameterized to better handle high levels of biological chimeras, higher coverage requirement for correction, not using multiple subreads from the same molecule (not using -a in Falcon), but this will depend on the extent of the problem and assembler being used.
rhall is offline   Reply With Quote
Old 11-18-2016, 08:46 AM   #4
k-gun12
Member
 
Location: NJ

Join Date: Feb 2010
Posts: 51
Default

Thanks.. I have not yet tried Falcon. Maybe it's worth a shot. I think heterozygosity is a real problem for pacbio and I'm wondering if it is causing some of my issues. My samples are multiisolates and have not spent years in culture that would breed out variation. I dug up this thread:

https://github.com/marbl/canu/issues/221

That seems to mirror my issues as well. When I noticed this problem, my first thoughts were "this can't apply only to me" since it was present in every assembly we've made using RSII data regardless of covearge, but perhaps most other folks are using clonal lines or inbred populations.
k-gun12 is offline   Reply With Quote
Old 11-18-2016, 09:04 AM   #5
gconcepcion
Member
 
Location: Menlo Park

Join Date: Dec 2010
Posts: 67
Default

Quote:
Originally Posted by k-gun12 View Post
Thanks.. I have not yet tried Falcon. Maybe it's worth a shot. I think heterozygosity is a real problem for pacbio and I'm wondering if it is causing some of my issues. My samples are multiisolates and have not spent years in culture that would breed out variation. I dug up this thread:

https://github.com/marbl/canu/issues/221

That seems to mirror my issues as well. When I noticed this problem, my first thoughts were "this can't apply only to me" since it was present in every assembly we've made using RSII data regardless of covearge, but perhaps most other folks are using clonal lines or inbred populations.
Heterozygosity is a real issue for assembling data from any technology, not just pacbio. this is likely to be an issue with any multi-isolate algal culture. The best way for algae is to do single-cell isolates and subsequently grow into a clonal culture. I spent alot of time as an undergrad and postdoc doing single cell algal isolates. Not difficult, just tedious. Serial dilutions are key...
gconcepcion is offline   Reply With Quote
Old 11-18-2016, 09:30 AM   #6
k-gun12
Member
 
Location: NJ

Join Date: Feb 2010
Posts: 51
Default

I agree, but Illumina sequencing of these same cultures would not exhibit this problem. Granted, the assembly was in thousands and thousands of contigs, but there was no redundancy and the gene predictions could be trusted. Right now, I'd rather have a fragmented assembly that accurately reflects copy number instead of what outwardly appears to be very large and duplicated gene families. I suppose it depends on where your priorities are.
k-gun12 is offline   Reply With Quote
Old 11-18-2016, 09:55 AM   #7
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 314
Default

It's always going to be difficult to assemble something that is highly heterozygous, if you have illumina data you may want to try http://www.genome.umd.edu/masurca.html there is some evidence that this approach better maintains the separation of haplotypes before overlap assembly.
rhall is offline   Reply With Quote
Old 11-18-2016, 10:02 AM   #8
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 314
Default

I'm having a problem understanding why Illumina assembly wouldn't show the same problem. Is the assumption that areas of high heterozygosity simply get broken in the De Bruijn graph? At some point even with Illumina data you will assemble out different haplotypes, particularly in highly hetrozygous regions.
Why not just filter the pacbio contigs for consistent expected coverage of raw reads?
rhall is offline   Reply With Quote
Old 05-03-2017, 09:18 AM   #9
cstack
Member
 
Location: Florida, US

Join Date: May 2017
Posts: 12
Default

Quote:
Originally Posted by k-gun12 View Post
I corrected, assembled and polished the genome with Canu, and was pretty pleased with the results until I blasted the genome into itself and found dozens and dozens of repeated DNA regions up to and gt 50kbp that occur in multiple contigs - usually at the ends but not always.
Could these be true repetitive sequence? They might occur at the ends of scaffolds because it is difficult to assembly long stretches of repeats.

Quote:
Originally Posted by k-gun12 View Post
It has gotten so bad that I've found chloroplast fragments assembled in with the genomic DNA contigs. Has anyone else encountered this?
I have had the same thing happen recently when I used PBjelly to fill in the gaps of a plant genome assembly using ~20x PacBio coverage. A large (~40kbp) fragment that seems to belong to the chloroplast was placed in the middle of a very large 10Gbp scaffold. The fragment was nested in a region with a lot of repetitive sequence, and it might have represented an LTR transposon, based on some quick scans with repeat masker.

I assumed that PBjelly was mis-placing an LTRtransposon or other repetitive sequence.

How did you work this out in the end?
cstack is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:19 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO