Seqanswers Leaderboard Ad

**gconcepcion** · 11-15-2016, 11:48 AM

One way to circumvent this is with the overlap_filtering_setting in FALCON. This allows you to filter out "chimeric contigs" due to the fact that overlap coverage will differ across the contig. The coverage in repetitive regions will be much higher relative to everything else.

I'm not aware of a similar setting in canu

**rhall** · 11-16-2016, 11:31 AM

There is always a non zero chance or creating biological chimeras in sample prep, adapters are blunt end ligated to the sheared DNA therefore it is always possible that fragments ligate to one another before having adapters attached. Obviously the adapter concentration is optimized to minimize this and in general biological chimeras are extremely rare, but it is possible that mistakes in sample prep can results in much higher numbers.
Even if biological chimers do occur they are random so should not have support from other reads i.e. the first step of assembly corrects them. But in cases of bad sample prep it is possible that chimeras, due to their large number, pass correction and result in miss-assemblies. As pointed out in the above post preassembly can be parameterized to better handle high levels of biological chimeras, higher coverage requirement for correction, not using multiple subreads from the same molecule (not using -a in Falcon), but this will depend on the extent of the problem and assembler being used.

**k-gun12** · 11-18-2016, 08:46 AM

Thanks.. I have not yet tried Falcon. Maybe it's worth a shot. I think heterozygosity is a real problem for pacbio and I'm wondering if it is causing some of my issues. My samples are multiisolates and have not spent years in culture that would breed out variation. I dug up this thread:

Enhanced parameters for gene dense and heterozygous genomes · Issue #221 · marbl/canu

https://github.com/marbl/canu/issues/221

Hi everyone ! I'm trying to use Canu in order to assemble the D. suzukii genome. As flies genome are genes dense (genes are very close to each others), and as the D. suzukii species contains a lot ...

That seems to mirror my issues as well. When I noticed this problem, my first thoughts were "this can't apply only to me" since it was present in every assembly we've made using RSII data regardless of covearge, but perhaps most other folks are using clonal lines or inbred populations.

**gconcepcion** · 11-18-2016, 09:04 AM

Originally posted by k-gun12 View Post

Thanks.. I have not yet tried Falcon. Maybe it's worth a shot. I think heterozygosity is a real problem for pacbio and I'm wondering if it is causing some of my issues. My samples are multiisolates and have not spent years in culture that would breed out variation. I dug up this thread:

Enhanced parameters for gene dense and heterozygous genomes · Issue #221 · marbl/canu

https://github.com/marbl/canu/issues/221

Hi everyone ! I'm trying to use Canu in order to assemble the D. suzukii genome. As flies genome are genes dense (genes are very close to each others), and as the D. suzukii species contains a lot ...

That seems to mirror my issues as well. When I noticed this problem, my first thoughts were "this can't apply only to me" since it was present in every assembly we've made using RSII data regardless of covearge, but perhaps most other folks are using clonal lines or inbred populations.

Heterozygosity is a real issue for assembling data from any technology, not just pacbio. this is likely to be an issue with any multi-isolate algal culture. The best way for algae is to do single-cell isolates and subsequently grow into a clonal culture. I spent alot of time as an undergrad and postdoc doing single cell algal isolates. Not difficult, just tedious. Serial dilutions are key...

**k-gun12** · 11-18-2016, 09:30 AM

I agree, but Illumina sequencing of these same cultures would not exhibit this problem. Granted, the assembly was in thousands and thousands of contigs, but there was no redundancy and the gene predictions could be trusted. Right now, I'd rather have a fragmented assembly that accurately reflects copy number instead of what outwardly appears to be very large and duplicated gene families. I suppose it depends on where your priorities are.

**rhall** · 11-18-2016, 09:55 AM

It's always going to be difficult to assemble something that is highly heterozygous, if you have illumina data you may want to try http://www.genome.umd.edu/masurca.html there is some evidence that this approach better maintains the separation of haplotypes before overlap assembly.

**rhall** · 11-18-2016, 10:02 AM

I'm having a problem understanding why Illumina assembly wouldn't show the same problem. Is the assumption that areas of high heterozygosity simply get broken in the De Bruijn graph? At some point even with Illumina data you will assemble out different haplotypes, particularly in highly hetrozygous regions.
Why not just filter the pacbio contigs for consistent expected coverage of raw reads?

**cstack** · 05-03-2017, 08:18 AM

Originally posted by k-gun12 View Post

I corrected, assembled and polished the genome with Canu, and was pretty pleased with the results until I blasted the genome into itself and found dozens and dozens of repeated DNA regions up to and gt 50kbp that occur in multiple contigs - usually at the ends but not always.

Could these be true repetitive sequence? They might occur at the ends of scaffolds because it is difficult to assembly long stretches of repeats.

Originally posted by k-gun12 View Post

It has gotten so bad that I've found chloroplast fragments assembled in with the genomic DNA contigs. Has anyone else encountered this?

I have had the same thing happen recently when I used PBjelly to fill in the gaps of a plant genome assembly using ~20x PacBio coverage. A large (~40kbp) fragment that seems to belong to the chloroplast was placed in the middle of a very large 10Gbp scaffold. The fragment was nested in a region with a lot of repetitive sequence, and it might have represented an LTR transposon, based on some quick scans with repeat masker.

I assumed that PBjelly was mis-placing an LTRtransposon or other repetitive sequence.

How did you work this out in the end?

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Systemic problem with PacBio data and chimeric contigs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News