Seqanswers Leaderboard Ad

**afitz** · 11-14-2013, 12:21 PM

"So you want a program that will parse a database of curated reference genome sequences based on user input, then extract a subset of those genomes from a subset of those reference genomes?"

I don't necessarily need to extract a subset of the genomes - I just want a way to obtain a random subset of the genomes contained in a given taxonomic category. For example, I would want to be able to ask for Bacteria and receive several bacterial genomes from a sequence database. Thanks for your question!

**gringer** · 11-14-2013, 08:36 PM

I just want a way to obtain a random subset of the genomes contained in a given taxonomic category. For example, I would want to be able to ask for Bacteria and receive several bacterial genomes from a sequence database.

This sounds like too specific a task for pre-existing code, but that doesn't mean someone else hasn't thought similarly in the past and made their own solution. Traversing the NCBI taxonomy is somewhat difficult, but doable. You'd probably be working of the taxonomy data from here:

ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

In particular, nodes.dmp and names.dmp inside taxdump to get/parse the tree, and gi_taxid_nucl.dmp to get genbank accession ID to taxID mappings. I guess the trick will be to filter those accession IDs to only have full chromosomal (or contig) sequences, rather than subsets of sequence.

Once you have NCBI accession numbers, you can retrieve the IDs and sequences using eSearch and eFetch:

Sample Applications of the E-utilities - Entrez Programming Utilities Help - NCBI Bookshelf

http://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.ESearch__ESummaryEFetch

This chapter presents several examples of how the E-utilities can be used to build useful applications. These examples use Perl to create the E-utility pipelines, and assume that the LWP::Simple module is installed. This module includes the get function that supports HTTP GET requests. One example (Application 4) uses an HTTP POST request, and requires the LWP::UserAgent module. In Perl, scalar variable names are preceded by a "$" symbol, and array names are preceded by a "@". In several instances, results will be stored in such variables for use in subsequent E-utility calls. The code examples here are working programs that can be copied to a text editor and executed directly. Equivalent HTTP requests can be constructed in many modern programming languages; all that is required is the ability to create and post an HTTP request.

You do need to be a bit careful when extracting tons of sequence with eFetch, because it has a maximum limit on the sequences that it will return in one request (something like 10,000).

Another problem for you will be what you mean by "random". The NCBI taxa aren't very well structured, so you will be getting quite a biased sample (i.e. weighted heavily on the more researched organisms) by picking sequences using a uniform distribution.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 50 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Obtaining Random Sequences from Given Taxonomic Grouping

Comment

Comment

Latest Articles

ad_right_rmr

News