Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Obtaining Random Sequences from Given Taxonomic Grouping

    Hello,

    I apologize if this question is too simple, I am new to bioinformatics and am trying to completely my first independent project. I am trying to retrieve DNA sequences from a set of random organisms within a given taxonomic group. For example, I want to be able to input "Mammalia" and retrieve subsets of say, 5 mammalian genomes. I have been looking into the NCBI resources including the taxdump files, the Taxonomy database, and RefSeq, but am struggling to put these resources together in order to traverse a taxonomy and retrieve random sequences from different taxonomic levels.

    Any hints on how/where to begin would be appreciated so much! Thank you!!

  • #2
    "So you want a program that will parse a database of curated reference genome sequences based on user input, then extract a subset of those genomes from a subset of those reference genomes?"

    I don't necessarily need to extract a subset of the genomes - I just want a way to obtain a random subset of the genomes contained in a given taxonomic category. For example, I would want to be able to ask for Bacteria and receive several bacterial genomes from a sequence database. Thanks for your question!

    Comment


    • #3
      I just want a way to obtain a random subset of the genomes contained in a given taxonomic category. For example, I would want to be able to ask for Bacteria and receive several bacterial genomes from a sequence database.
      This sounds like too specific a task for pre-existing code, but that doesn't mean someone else hasn't thought similarly in the past and made their own solution. Traversing the NCBI taxonomy is somewhat difficult, but doable. You'd probably be working of the taxonomy data from here:

      ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

      In particular, nodes.dmp and names.dmp inside taxdump to get/parse the tree, and gi_taxid_nucl.dmp to get genbank accession ID to taxID mappings. I guess the trick will be to filter those accession IDs to only have full chromosomal (or contig) sequences, rather than subsets of sequence.

      Once you have NCBI accession numbers, you can retrieve the IDs and sequences using eSearch and eFetch:

      This chapter presents several examples of how the E-utilities can be used to build useful applications. These examples use Perl to create the E-utility pipelines, and assume that the LWP::Simple module is installed. This module includes the get function that supports HTTP GET requests. One example (Application 4) uses an HTTP POST request, and requires the LWP::UserAgent module. In Perl, scalar variable names are preceded by a "$" symbol, and array names are preceded by a "@". In several instances, results will be stored in such variables for use in subsequent E-utility calls. The code examples here are working programs that can be copied to a text editor and executed directly. Equivalent HTTP requests can be constructed in many modern programming languages; all that is required is the ability to create and post an HTTP request.


      You do need to be a bit careful when extracting tons of sequence with eFetch, because it has a maximum limit on the sequences that it will return in one request (something like 10,000).

      Another problem for you will be what you mean by "random". The NCBI taxa aren't very well structured, so you will be getting quite a biased sample (i.e. weighted heavily on the more researched organisms) by picking sequences using a uniform distribution.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      22 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      24 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      19 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      50 views
      0 likes
      Last Post seqadmin  
      Working...
      X