I've got ~30x coverage of a small < 100MB algal genome using PB RSII. I corrected, assembled and polished the genome with Canu, and was pretty pleased with the results until I blasted the genome into itself and found dozens and dozens of repeated DNA regions up to and gt 50kbp that occur in multiple contigs - usually at the ends but not always. The Canu developers helped tweak my run a bit, but the problem persisted. Recently I used the same workflow with a different alga and see the exact same problem, and have recently spoken to another lab (working on corals) with identical issues using SMRTmake (not sure if it was HGAP.3 or not). It has gotten so bad that I've found chloroplast fragments assembled in with the genomic DNA contigs. Has anyone else encountered this? My runs were all done on different instruments with different extraction protocols.. is the RSII creating chimeric reads? Thanks in advance.
Unconfigured Ad
Collapse
X
-
One way to circumvent this is with the overlap_filtering_setting in FALCON. This allows you to filter out "chimeric contigs" due to the fact that overlap coverage will differ across the contig. The coverage in repetitive regions will be much higher relative to everything else.
I'm not aware of a similar setting in canu
-
-
There is always a non zero chance or creating biological chimeras in sample prep, adapters are blunt end ligated to the sheared DNA therefore it is always possible that fragments ligate to one another before having adapters attached. Obviously the adapter concentration is optimized to minimize this and in general biological chimeras are extremely rare, but it is possible that mistakes in sample prep can results in much higher numbers.
Even if biological chimers do occur they are random so should not have support from other reads i.e. the first step of assembly corrects them. But in cases of bad sample prep it is possible that chimeras, due to their large number, pass correction and result in miss-assemblies. As pointed out in the above post preassembly can be parameterized to better handle high levels of biological chimeras, higher coverage requirement for correction, not using multiple subreads from the same molecule (not using -a in Falcon), but this will depend on the extent of the problem and assembler being used.
Comment
-
-
Thanks.. I have not yet tried Falcon. Maybe it's worth a shot. I think heterozygosity is a real problem for pacbio and I'm wondering if it is causing some of my issues. My samples are multiisolates and have not spent years in culture that would breed out variation. I dug up this thread:
Hi everyone ! I'm trying to use Canu in order to assemble the D. suzukii genome. As flies genome are genes dense (genes are very close to each others), and as the D. suzukii species contains a lot ...
That seems to mirror my issues as well. When I noticed this problem, my first thoughts were "this can't apply only to me" since it was present in every assembly we've made using RSII data regardless of covearge, but perhaps most other folks are using clonal lines or inbred populations.
Comment
-
-
Heterozygosity is a real issue for assembling data from any technology, not just pacbio. this is likely to be an issue with any multi-isolate algal culture. The best way for algae is to do single-cell isolates and subsequently grow into a clonal culture. I spent alot of time as an undergrad and postdoc doing single cell algal isolates. Not difficult, just tedious. Serial dilutions are key...Originally posted by k-gun12 View PostThanks.. I have not yet tried Falcon. Maybe it's worth a shot. I think heterozygosity is a real problem for pacbio and I'm wondering if it is causing some of my issues. My samples are multiisolates and have not spent years in culture that would breed out variation. I dug up this thread:
Hi everyone ! I'm trying to use Canu in order to assemble the D. suzukii genome. As flies genome are genes dense (genes are very close to each others), and as the D. suzukii species contains a lot ...
That seems to mirror my issues as well. When I noticed this problem, my first thoughts were "this can't apply only to me" since it was present in every assembly we've made using RSII data regardless of covearge, but perhaps most other folks are using clonal lines or inbred populations.
Comment
-
-
I agree, but Illumina sequencing of these same cultures would not exhibit this problem. Granted, the assembly was in thousands and thousands of contigs, but there was no redundancy and the gene predictions could be trusted. Right now, I'd rather have a fragmented assembly that accurately reflects copy number instead of what outwardly appears to be very large and duplicated gene families. I suppose it depends on where your priorities are.
Comment
-
-
It's always going to be difficult to assemble something that is highly heterozygous, if you have illumina data you may want to try http://www.genome.umd.edu/masurca.html there is some evidence that this approach better maintains the separation of haplotypes before overlap assembly.
Comment
-
-
I'm having a problem understanding why Illumina assembly wouldn't show the same problem. Is the assumption that areas of high heterozygosity simply get broken in the De Bruijn graph? At some point even with Illumina data you will assemble out different haplotypes, particularly in highly hetrozygous regions.
Why not just filter the pacbio contigs for consistent expected coverage of raw reads?
Comment
-
-
Could these be true repetitive sequence? They might occur at the ends of scaffolds because it is difficult to assembly long stretches of repeats.Originally posted by k-gun12 View PostI corrected, assembled and polished the genome with Canu, and was pretty pleased with the results until I blasted the genome into itself and found dozens and dozens of repeated DNA regions up to and gt 50kbp that occur in multiple contigs - usually at the ends but not always.
I have had the same thing happen recently when I used PBjelly to fill in the gaps of a plant genome assembly using ~20x PacBio coverage. A large (~40kbp) fragment that seems to belong to the chloroplast was placed in the middle of a very large 10Gbp scaffold. The fragment was nested in a region with a lot of repetitive sequence, and it might have represented an LTR transposon, based on some quick scans with repeat masker.Originally posted by k-gun12 View PostIt has gotten so bad that I've found chloroplast fragments assembled in with the genomic DNA contigs. Has anyone else encountered this?
I assumed that PBjelly was mis-placing an LTRtransposon or other repetitive sequence.
How did you work this out in the end?
Comment
-
Latest Articles
Collapse
-
by SEQadmin2
I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.
Here are nine questions we think about, in roughly the order they matter, before...-
Channel: Articles
06-18-2026, 07:11 AM -
-
by SEQadmin2
Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.
The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...-
Channel: Articles
06-02-2026, 10:05 AM -
ad_right_rmr
Collapse
News
Collapse
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, Today, 11:10 AM
|
0 responses
6 views
0 reactions
|
Last Post
by SEQadmin2
Today, 11:10 AM
|
||
|
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population
by SEQadmin2
Started by SEQadmin2, 06-17-2026, 06:09 AM
|
0 responses
41 views
0 reactions
|
Last Post
by SEQadmin2
06-17-2026, 06:09 AM
|
||
|
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism
by SEQadmin2
Started by SEQadmin2, 06-09-2026, 11:58 AM
|
0 responses
102 views
0 reactions
|
Last Post
by SEQadmin2
06-09-2026, 11:58 AM
|
||
|
Started by SEQadmin2, 06-05-2026, 10:09 AM
|
0 responses
123 views
0 reactions
|
Last Post
by SEQadmin2
06-05-2026, 10:09 AM
|
Comment