Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Obtaining Random Sequences from Given Taxonomic Grouping

    Hello,

    I apologize if this question is too simple, I am new to bioinformatics and am trying to completely my first independent project. I am trying to retrieve DNA sequences from a set of random organisms within a given taxonomic group. For example, I want to be able to input "Mammalia" and retrieve subsets of say, 5 mammalian genomes. I have been looking into the NCBI resources including the taxdump files, the Taxonomy database, and RefSeq, but am struggling to put these resources together in order to traverse a taxonomy and retrieve random sequences from different taxonomic levels.

    Any hints on how/where to begin would be appreciated so much! Thank you!!

  • #2
    "So you want a program that will parse a database of curated reference genome sequences based on user input, then extract a subset of those genomes from a subset of those reference genomes?"

    I don't necessarily need to extract a subset of the genomes - I just want a way to obtain a random subset of the genomes contained in a given taxonomic category. For example, I would want to be able to ask for Bacteria and receive several bacterial genomes from a sequence database. Thanks for your question!

    Comment


    • #3
      I just want a way to obtain a random subset of the genomes contained in a given taxonomic category. For example, I would want to be able to ask for Bacteria and receive several bacterial genomes from a sequence database.
      This sounds like too specific a task for pre-existing code, but that doesn't mean someone else hasn't thought similarly in the past and made their own solution. Traversing the NCBI taxonomy is somewhat difficult, but doable. You'd probably be working of the taxonomy data from here:

      ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

      In particular, nodes.dmp and names.dmp inside taxdump to get/parse the tree, and gi_taxid_nucl.dmp to get genbank accession ID to taxID mappings. I guess the trick will be to filter those accession IDs to only have full chromosomal (or contig) sequences, rather than subsets of sequence.

      Once you have NCBI accession numbers, you can retrieve the IDs and sequences using eSearch and eFetch:

      This chapter presents several examples of how the E-utilities can be used to build useful applications. These examples use Perl to create the E-utility pipelines, and assume that the LWP::Simple module is installed. This module includes the get function that supports HTTP GET requests. One example (Application 4) uses an HTTP POST request, and requires the LWP::UserAgent module. In Perl, scalar variable names are preceded by a "$" symbol, and array names are preceded by a "@". In several instances, results will be stored in such variables for use in subsequent E-utility calls. The code examples here are working programs that can be copied to a text editor and executed directly. Equivalent HTTP requests can be constructed in many modern programming languages; all that is required is the ability to create and post an HTTP request.


      You do need to be a bit careful when extracting tons of sequence with eFetch, because it has a maximum limit on the sequences that it will return in one request (something like 10,000).

      Another problem for you will be what you mean by "random". The NCBI taxa aren't very well structured, so you will be getting quite a biased sample (i.e. weighted heavily on the more researched organisms) by picking sequences using a uniform distribution.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM
      • seqadmin
        The Impact of AI in Genomic Medicine
        by seqadmin



        Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
        02-26-2024, 02:07 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 03-14-2024, 06:13 AM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-08-2024, 08:03 AM
      0 responses
      71 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-07-2024, 08:13 AM
      0 responses
      80 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-06-2024, 09:51 AM
      0 responses
      68 views
      0 likes
      Last Post seqadmin  
      Working...
      X