Seqanswers Leaderboard Ad

**rhall** · 07-25-2014, 08:55 AM

There could be a number of reasons why the yield is so low, but here is my best guess.
Because you have such high coverage the seed length is very high and at the extremes of the subread length distribution, to such an extent that actual inserts above this seed length are most likely chimeras (blunt end ligation between fragments in the sample prep), or have missing adapters (this can be seen by looking of a dot plot of the longest reads, a missing adapter will show up as repeats in the forward and reverse strand). The library most likely does not have many real subreads >14kb particularly given the difficulty in fragmenting such a small genome. During preassembly a seed read that is chimeric or missing an adapter will be split due to the lack of support and only a portion of the read will be corrected. This can be seen in the 3,543bp Pre-Assembled readlength.
To test this hypothesis the best way would be to sub-sample the data to a coverage of ~100x, but unfortunately there is no easy way to do this. It is possible to use the whitelist functionality to select for 1/10 of the reads https://github.com/PacificBioscience...sting-Tutorial or simply increase the filtering parameters to limit the coverage, I would increase the ploymerase read length until only ~100x passes filter, you could increase the subread length filter, or the RQ, but these may introduce other issues.

**earleyej** · 07-25-2014, 11:31 AM

Appreciate the help. I hadn't thought of the chimeric reads idea. I'll give this a shot.

Are extremely long reads (say, >15kb) frequently chimeras? Any idea of the frequency?

**rhall** · 07-25-2014, 03:33 PM

Given a library with really long fragments long reads are no more likely to be chimeric than shorter reads, the rate is ~1%, but this is variable dependent on library prep. In this case the hypothesis is that there are no extremely long fragments, but given that the polymerase read lengths are simply dependent on movie time, there are some occasions where we record really long anomalous reads either because we don't detect the adapters in the read (the multiple passes are catted together in forward and backwards repeats) or we have inserts that are ligations of multiple fragments. If we don't have long fragments >14kb it is only in these cases that we generate reads 14kb.
This is only a hypothesis mind you

**earleyej** · 07-29-2014, 07:58 AM

so to clarify:
- the concern here is chimeric fragments/reads >14kb, especially considering the difficulty in fragmenting such a small genome
- Reduce overall coverage: in the filtering step, bump up the minimum polymerase read length until only ~100x sequence remains

But why would I enrich my subread pool for longer reads if I'm concerned about longer reads causing the problem? It seems like smaller reads would be better in this case. So really I need a 'max' polymerase read length filter. Or am I misunderstanding?

**rhall** · 07-30-2014, 08:13 AM

Increasing the minimum polymerase read length will not in general select for longer subreads, you will still have short subreads from inserts that have multiple passes (a long polymerase read can be multiple passes of a short insert). Actually due to a peculiarity of the polymerase action, shorter subreads are more likely to generate longer polymerase reads.
The ideal solution would be to set a max subread length (~7kb) and randomly select 100x of the resulting reads, but unfortunately the only way I can think of doing this is using whitelisting (see link above)

**jbadalam** · 08-04-2014, 10:40 AM

This post comes more out of curiosity rather than troubleshooting, but here goes...

I recently assembled two bacterial genomes (3.8 and 4.5 Mbp) using HGAP3 (default options) and I've been running into the same problem with much lower pre-assembled yield than I had been seeing previously. Both libraries had 20-kb size-selected inserts and were sequenced with P4-C2 chemistry and 180-minute movies. Each library was sequenced with 2 SMRTcells with a yield of ~1 Gbp per genome, presumably more than sufficient coverage for HGAP.

The 4.5-Mbp genome assembled into 2 contigs (with evidence for breaking at multiple ~14-17-kb repeats such that more data may in fact be needed), but the pre-assembled yield was low, around ~0.42. As a result, Celera ran with only ~55 Mbp (~12x coverage), even though the read length distribution showed 100x coverage in reads >10 kb, and 6x in reads >20kb--that is, I had more than enough data to assemble with 25-30x.

The same thing happened with the 3.8-Mbp genome, but the pre-assembled yield was even lower (0.38). Fortunately this genome isn't repetitive and it closed on the first HGAP attempt, but again the assembly was performed with only ~11x coverage. I also tried PBcR in the latest wgs release (8.1alpha) and got the same coverage going into the assembly.

I should mention that a few months back we sequenced three other bacterial strains with the same DNA extraction protocols, library prep, size selection, and chemistry, and the pre-assembled yields were significantly better (0.74 to 0.88).

So my questions are:
1) were chimeric reads an issue here as well? And if so, what changes (if any) can be made to sample/library prep to minimize them?
2) could I adjust the target coverage to, say, 50x, forcing the length cutoff to be lowered knowing that self-correction will have low yield, such that I end up assembling with 25x?
3) are there any other HGAP or PBcR parameters that could be tweaked if I know that raw read coverage is sufficient?

**rhall** · 08-04-2014, 12:13 PM

I don't think the reason for the low yield is the same as in the above case, from the results it would appear that you have no problems generating long >20kb inserts.

1) were chimeric reads an issue here as well? And if so, what changes (if any) can be made to sample/library prep to minimize them?

As a check you should look at where you are loosing bases in the preassembly, are the preassembled reads getting shorter (Pre-Assembled Read Length <<< Seed Cutoff), or are seed reads not getting corrected at all (Number of Corrected Reads <<< Seed Reads). If seed reads are getting heavily truncated that would indicate that the longest reads don't have support over their length and are possibly chimeric, given that you have a reasonable assembly (reference) you could check this by running Bridgemapper (compare the number of bridged reads / bridge-distance with one of the high yield samples). If the number of corrected reads drops then that is an indication of a lower coverage contamination within the sample (lower coverage data will not correct), you can check this by looking at the number of unmapped reads in the resequencing / polishing step.
A high number of chimers could be caused by a low adapter concentration in the adapter ligation step.

2) could I adjust the target coverage to, say, 50x, forcing the length cutoff to be lowered knowing that self-correction will have low yield, such that I end up assembling with 25x?

Manually setting the seed length lower, or reducing the estimated genome size will have the effect of increasing the coverage going into preassembly. Even with low yield so long as >12x coverage goes into CA I generally see good results.

3) are there any other HGAP or PBcR parameters that could be tweaked if I know that raw read coverage is sufficient?

It is possible to increase the yield by reducing the coverage required for correction, this generally allows more anomalous data through preassembly and will result in miss-asemblies, but it can be a useful diagnostic tool.

**jbadalam** · 08-12-2014, 02:56 PM

There does appear to be some "contamination" in the unmapped subreads (15% for the low yield genome vs. 8% for the genome with high preassembled yield). Some of these reads look chimeric, but overall the fraction of chimeric reads is similar for both genomes (0.66% for low yield, 0.4% for high yield) - thanks to this script: http://www.cbcb.umd.edu/software/pbcr/

If the dataset has a normal number of chimeric reads, what else might lead to the longest reads not having support across their length?

With Bridgemapper output in SMRTview, what's the difference between primary, prolog, and epilog? And by default any bridged reads are not used for consensus calling with quiver due because they ambiguously map, correct?

**rhall** · 08-12-2014, 03:39 PM

If the dataset has a normal number of chimeric reads, what else might lead to the longest reads not having support across their length?

Given a homogeneous sample, coverage variation either stochastic, or due to the mapping of a repetitive element.

With Bridgemapper output in SMRTview, what's the difference between primary, prolog, and epilog? And by default any bridged reads are not used for consensus calling with quiver due because they ambiguously map, correct?

Blasr makes the best local alignment of the subread, this can leave a prolog and epilog sequence that is not aligned. Bridgemapper takes these sequences and independently maps and aligns them. Incorrect a bridgemapped subread is not mapped ambiguously (local alignment) it just has portions that map to different places, they are used in calling quiver consensus.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Low Pre-Assembly yield in HGAP2

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News