Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Whole assembly vs Golden path length in Ensembl?

    This has been puzzling me for a while and I haven't been able to find a good answer for this. On the Ensembl website (ensembl.org/) you can find reference genomes for various species. What confuses me is they have a "base pairs" assembly statistic and a "golden path length" statistic.

    After reading the FAQ/glossary, I get that the "base pairs" statistic is the whole assembly with redundant regions and haplotypes, and "golden path length" is the reference without these regions, but what I've found is that often the "base pairs" statistic is much shorter than the "golden path length". For example, the Cod assembly "base pair" stat is 608 Mb, and the "golden path length" is 832 MB. How is that possible? Is anyone familiar with this database? Thanks much.

  • #2
    I presume you are talking about this FAQ/Help:

    (Species Home Page) Base Pairs (whole assembly)

    The total number of base pairs for the entire assembly is the sum of all sequences in the dna table of the core database. It is available from the species-specific home page. This includes redundant regions such as haplotypic sequences and the pseudo-autosomal region (PAR) of the Y chromosome in human, and gaps in Drosophila melanogaster. See the assembly details of each species for more information.

    (Species Home Page) Golden Path

    The "golden path" is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs.
    Note that the information is coming from two different places -- the 'dna table' for base pairs, the 'seq_reqion table' for the golden path.

    Golden paths are usually built up from layering sequence information onto a physical map. They can also be created via combining scaffolds together into a "best guess". In either case they are an approximation of what the real genome looks like -- which is what the 'base pairs' assembly tries to reflect. Even the 'base pairs' assembly can be wildly wrong if not enough of the genome has been sequenced. For poorly sequenced genome I would not find it surprising to find 'base pairs' to be less than 'golden path' since the golden path will have a lot of gaps between the known sequences while the 'base pairs' is a simple sum of base pair counts.

    In the cod case Ensembl says that the genome is 0.9 GB while the golden path is 0.83 GB and the base pairs are 0.61 GB. So one or all of those numbers are incorrect. Given that cod has only been sequenced to a depth of 25x I suspect that there is a lot of sequencing yet to be done and eventually that 'base pairs' number will be raised. Ensembl goes on to say "... Owing to the fragmentary nature of the Atlantic cod assembly ..." which just reinforces the fact that not all of the base pairs are known.

    Hope this helps.

    Comment


    • #3
      Thanks, that really helps. I've recently de novo assembled a fish genome, so I'm trying to compare how close others have come to assembling a complete genome (with respect to the theoretical genome size based on C value). I didn't realize that the "base pair" statistic doesn't include gaps from scaffolds, so it makes sense that (especially in a highly repetitive genome such as in fish) it would be much lower than the "golden path".

      Thanks for your help!

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      7 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      7 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      49 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      66 views
      0 likes
      Last Post seqadmin  
      Working...
      X