Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Whole assembly vs Golden path length in Ensembl?

    This has been puzzling me for a while and I haven't been able to find a good answer for this. On the Ensembl website (ensembl.org/) you can find reference genomes for various species. What confuses me is they have a "base pairs" assembly statistic and a "golden path length" statistic.

    After reading the FAQ/glossary, I get that the "base pairs" statistic is the whole assembly with redundant regions and haplotypes, and "golden path length" is the reference without these regions, but what I've found is that often the "base pairs" statistic is much shorter than the "golden path length". For example, the Cod assembly "base pair" stat is 608 Mb, and the "golden path length" is 832 MB. How is that possible? Is anyone familiar with this database? Thanks much.

  • #2
    I presume you are talking about this FAQ/Help:

    (Species Home Page) Base Pairs (whole assembly)

    The total number of base pairs for the entire assembly is the sum of all sequences in the dna table of the core database. It is available from the species-specific home page. This includes redundant regions such as haplotypic sequences and the pseudo-autosomal region (PAR) of the Y chromosome in human, and gaps in Drosophila melanogaster. See the assembly details of each species for more information.

    (Species Home Page) Golden Path

    The "golden path" is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs.
    Note that the information is coming from two different places -- the 'dna table' for base pairs, the 'seq_reqion table' for the golden path.

    Golden paths are usually built up from layering sequence information onto a physical map. They can also be created via combining scaffolds together into a "best guess". In either case they are an approximation of what the real genome looks like -- which is what the 'base pairs' assembly tries to reflect. Even the 'base pairs' assembly can be wildly wrong if not enough of the genome has been sequenced. For poorly sequenced genome I would not find it surprising to find 'base pairs' to be less than 'golden path' since the golden path will have a lot of gaps between the known sequences while the 'base pairs' is a simple sum of base pair counts.

    In the cod case Ensembl says that the genome is 0.9 GB while the golden path is 0.83 GB and the base pairs are 0.61 GB. So one or all of those numbers are incorrect. Given that cod has only been sequenced to a depth of 25x I suspect that there is a lot of sequencing yet to be done and eventually that 'base pairs' number will be raised. Ensembl goes on to say "... Owing to the fragmentary nature of the Atlantic cod assembly ..." which just reinforces the fact that not all of the base pairs are known.

    Hope this helps.

    Comment


    • #3
      Thanks, that really helps. I've recently de novo assembled a fish genome, so I'm trying to compare how close others have come to assembling a complete genome (with respect to the theoretical genome size based on C value). I didn't realize that the "base pair" statistic doesn't include gaps from scaffolds, so it makes sense that (especially in a highly repetitive genome such as in fish) it would be much lower than the "golden path".

      Thanks for your help!

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 11:49 AM
      0 responses
      15 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-24-2024, 08:47 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      61 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Working...
      X