Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low Pre-Assembly yield in HGAP2

    I'm noticing surprisingly low pre-assembly yields for an HGAP 2.2 run on a small bacteriophage genome (~150kb). Depending mostly on the min seed read length, I see yields between 0.081 to 0.19. I know "good" yields are at least >0.5. Any idea why I'm seeing such low yields?

    Sequencing:
    1 SMRT
    ~426Mb sequence

    HGAP details (best run):
    min seed read length = auto
    tgt cov = 15x
    genome size = 150000
    all else is default

    HGAP results:
    1 contig @ ~139kb
    Polymerase Read Bases 295,181,019
    Length Cutoff 14,925
    Seed Bases 4,512,649
    Pre-Assembled bases 899,955
    Pre-Assembled Yield .199
    Pre-Assembled Reads 254
    Pre-Assembled Reads Length 3,543
    Pre-Assembled N50 5,210

    Is this simply because the genome is tiny? Theoretical coverage is ~1,966x (from polymerase read bases estimate); so is HGAP limiting the number of subreads participating in the pre-assembly?

    Thanks!

  • #2
    There could be a number of reasons why the yield is so low, but here is my best guess.
    Because you have such high coverage the seed length is very high and at the extremes of the subread length distribution, to such an extent that actual inserts above this seed length are most likely chimeras (blunt end ligation between fragments in the sample prep), or have missing adapters (this can be seen by looking of a dot plot of the longest reads, a missing adapter will show up as repeats in the forward and reverse strand). The library most likely does not have many real subreads >14kb particularly given the difficulty in fragmenting such a small genome. During preassembly a seed read that is chimeric or missing an adapter will be split due to the lack of support and only a portion of the read will be corrected. This can be seen in the 3,543bp Pre-Assembled readlength.
    To test this hypothesis the best way would be to sub-sample the data to a coverage of ~100x, but unfortunately there is no easy way to do this. It is possible to use the whitelist functionality to select for 1/10 of the reads https://github.com/PacificBioscience...sting-Tutorial or simply increase the filtering parameters to limit the coverage, I would increase the ploymerase read length until only ~100x passes filter, you could increase the subread length filter, or the RQ, but these may introduce other issues.

    Comment


    • #3
      Appreciate the help. I hadn't thought of the chimeric reads idea. I'll give this a shot.

      Are extremely long reads (say, >15kb) frequently chimeras? Any idea of the frequency?

      Comment


      • #4
        Given a library with really long fragments long reads are no more likely to be chimeric than shorter reads, the rate is ~1%, but this is variable dependent on library prep. In this case the hypothesis is that there are no extremely long fragments, but given that the polymerase read lengths are simply dependent on movie time, there are some occasions where we record really long anomalous reads either because we don't detect the adapters in the read (the multiple passes are catted together in forward and backwards repeats) or we have inserts that are ligations of multiple fragments. If we don't have long fragments >14kb it is only in these cases that we generate reads 14kb.
        This is only a hypothesis mind you

        Comment


        • #5
          so to clarify:
          - the concern here is chimeric fragments/reads >14kb, especially considering the difficulty in fragmenting such a small genome
          - Reduce overall coverage: in the filtering step, bump up the minimum polymerase read length until only ~100x sequence remains

          But why would I enrich my subread pool for longer reads if I'm concerned about longer reads causing the problem? It seems like smaller reads would be better in this case. So really I need a 'max' polymerase read length filter. Or am I misunderstanding?

          Comment


          • #6
            Increasing the minimum polymerase read length will not in general select for longer subreads, you will still have short subreads from inserts that have multiple passes (a long polymerase read can be multiple passes of a short insert). Actually due to a peculiarity of the polymerase action, shorter subreads are more likely to generate longer polymerase reads.
            The ideal solution would be to set a max subread length (~7kb) and randomly select 100x of the resulting reads, but unfortunately the only way I can think of doing this is using whitelisting (see link above)

            Comment


            • #7
              This post comes more out of curiosity rather than troubleshooting, but here goes...

              I recently assembled two bacterial genomes (3.8 and 4.5 Mbp) using HGAP3 (default options) and I've been running into the same problem with much lower pre-assembled yield than I had been seeing previously. Both libraries had 20-kb size-selected inserts and were sequenced with P4-C2 chemistry and 180-minute movies. Each library was sequenced with 2 SMRTcells with a yield of ~1 Gbp per genome, presumably more than sufficient coverage for HGAP.

              The 4.5-Mbp genome assembled into 2 contigs (with evidence for breaking at multiple ~14-17-kb repeats such that more data may in fact be needed), but the pre-assembled yield was low, around ~0.42. As a result, Celera ran with only ~55 Mbp (~12x coverage), even though the read length distribution showed 100x coverage in reads >10 kb, and 6x in reads >20kb--that is, I had more than enough data to assemble with 25-30x.

              The same thing happened with the 3.8-Mbp genome, but the pre-assembled yield was even lower (0.38). Fortunately this genome isn't repetitive and it closed on the first HGAP attempt, but again the assembly was performed with only ~11x coverage. I also tried PBcR in the latest wgs release (8.1alpha) and got the same coverage going into the assembly.

              I should mention that a few months back we sequenced three other bacterial strains with the same DNA extraction protocols, library prep, size selection, and chemistry, and the pre-assembled yields were significantly better (0.74 to 0.88).

              So my questions are:
              1) were chimeric reads an issue here as well? And if so, what changes (if any) can be made to sample/library prep to minimize them?
              2) could I adjust the target coverage to, say, 50x, forcing the length cutoff to be lowered knowing that self-correction will have low yield, such that I end up assembling with 25x?
              3) are there any other HGAP or PBcR parameters that could be tweaked if I know that raw read coverage is sufficient?

              Comment


              • #8
                I don't think the reason for the low yield is the same as in the above case, from the results it would appear that you have no problems generating long >20kb inserts.
                1) were chimeric reads an issue here as well? And if so, what changes (if any) can be made to sample/library prep to minimize them?
                As a check you should look at where you are loosing bases in the preassembly, are the preassembled reads getting shorter (Pre-Assembled Read Length <<< Seed Cutoff), or are seed reads not getting corrected at all (Number of Corrected Reads <<< Seed Reads). If seed reads are getting heavily truncated that would indicate that the longest reads don't have support over their length and are possibly chimeric, given that you have a reasonable assembly (reference) you could check this by running Bridgemapper (compare the number of bridged reads / bridge-distance with one of the high yield samples). If the number of corrected reads drops then that is an indication of a lower coverage contamination within the sample (lower coverage data will not correct), you can check this by looking at the number of unmapped reads in the resequencing / polishing step.
                A high number of chimers could be caused by a low adapter concentration in the adapter ligation step.
                2) could I adjust the target coverage to, say, 50x, forcing the length cutoff to be lowered knowing that self-correction will have low yield, such that I end up assembling with 25x?
                Manually setting the seed length lower, or reducing the estimated genome size will have the effect of increasing the coverage going into preassembly. Even with low yield so long as >12x coverage goes into CA I generally see good results.
                3) are there any other HGAP or PBcR parameters that could be tweaked if I know that raw read coverage is sufficient?
                It is possible to increase the yield by reducing the coverage required for correction, this generally allows more anomalous data through preassembly and will result in miss-asemblies, but it can be a useful diagnostic tool.

                Comment


                • #9
                  There does appear to be some "contamination" in the unmapped subreads (15% for the low yield genome vs. 8% for the genome with high preassembled yield). Some of these reads look chimeric, but overall the fraction of chimeric reads is similar for both genomes (0.66% for low yield, 0.4% for high yield) - thanks to this script: http://www.cbcb.umd.edu/software/pbcr/

                  If the dataset has a normal number of chimeric reads, what else might lead to the longest reads not having support across their length?

                  With Bridgemapper output in SMRTview, what's the difference between primary, prolog, and epilog? And by default any bridged reads are not used for consensus calling with quiver due because they ambiguously map, correct?

                  Comment


                  • #10
                    If the dataset has a normal number of chimeric reads, what else might lead to the longest reads not having support across their length?
                    Given a homogeneous sample, coverage variation either stochastic, or due to the mapping of a repetitive element.

                    With Bridgemapper output in SMRTview, what's the difference between primary, prolog, and epilog? And by default any bridged reads are not used for consensus calling with quiver due because they ambiguously map, correct?
                    Blasr makes the best local alignment of the subread, this can leave a prolog and epilog sequence that is not aligned. Bridgemapper takes these sequences and independently maps and aligns them. Incorrect a bridgemapped subread is not mapped ambiguously (local alignment) it just has portions that map to different places, they are used in calling quiver consensus.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 08:47 AM
                    0 responses
                    14 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    54 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X