SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
Inquiry: minimum length of reads for referece-based assembly or de novo assembly sunfuhui Bioinformatics 1 10-04-2013 10:28 AM
RNA-Seq quality controls: golden standard tool? alap RNA Sequencing 2 04-30-2012 11:00 PM
paired-end read length for de novo assembly Seqasaurus Illumina/Solexa 4 10-19-2011 04:32 AM
de novo assembly and Illumina read length bkingham De novo discovery 1 11-17-2009 12:15 AM

Reply
 
Thread Tools
Old 07-31-2014, 09:30 AM   #1
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default Whole assembly vs Golden path length in Ensembl?

This has been puzzling me for a while and I haven't been able to find a good answer for this. On the Ensembl website (ensembl.org/) you can find reference genomes for various species. What confuses me is they have a "base pairs" assembly statistic and a "golden path length" statistic.

After reading the FAQ/glossary, I get that the "base pairs" statistic is the whole assembly with redundant regions and haplotypes, and "golden path length" is the reference without these regions, but what I've found is that often the "base pairs" statistic is much shorter than the "golden path length". For example, the Cod assembly "base pair" stat is 608 Mb, and the "golden path length" is 832 MB. How is that possible? Is anyone familiar with this database? Thanks much.
jwag is offline   Reply With Quote
Old 08-01-2014, 08:53 AM   #2
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

I presume you are talking about this FAQ/Help:

Quote:
(Species Home Page) Base Pairs (whole assembly)

The total number of base pairs for the entire assembly is the sum of all sequences in the dna table of the core database. It is available from the species-specific home page. This includes redundant regions such as haplotypic sequences and the pseudo-autosomal region (PAR) of the Y chromosome in human, and gaps in Drosophila melanogaster. See the assembly details of each species for more information.

(Species Home Page) Golden Path

The "golden path" is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs.
Note that the information is coming from two different places -- the 'dna table' for base pairs, the 'seq_reqion table' for the golden path.

Golden paths are usually built up from layering sequence information onto a physical map. They can also be created via combining scaffolds together into a "best guess". In either case they are an approximation of what the real genome looks like -- which is what the 'base pairs' assembly tries to reflect. Even the 'base pairs' assembly can be wildly wrong if not enough of the genome has been sequenced. For poorly sequenced genome I would not find it surprising to find 'base pairs' to be less than 'golden path' since the golden path will have a lot of gaps between the known sequences while the 'base pairs' is a simple sum of base pair counts.

In the cod case Ensembl says that the genome is 0.9 GB while the golden path is 0.83 GB and the base pairs are 0.61 GB. So one or all of those numbers are incorrect. Given that cod has only been sequenced to a depth of 25x I suspect that there is a lot of sequencing yet to be done and eventually that 'base pairs' number will be raised. Ensembl goes on to say "... Owing to the fragmentary nature of the Atlantic cod assembly ..." which just reinforces the fact that not all of the base pairs are known.

Hope this helps.
westerman is offline   Reply With Quote
Old 08-01-2014, 10:01 AM   #3
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default

Thanks, that really helps. I've recently de novo assembled a fish genome, so I'm trying to compare how close others have come to assembling a complete genome (with respect to the theoretical genome size based on C value). I didn't realize that the "base pair" statistic doesn't include gaps from scaffolds, so it makes sense that (especially in a highly repetitive genome such as in fish) it would be much lower than the "golden path".

Thanks for your help!
jwag is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:29 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO