Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract multiple sequences on ncbi by gene symbol

    Hi all,

    I have a lot of gene symbols belonging to the same organism and I need to extract the corresponding sequences on ncbi.

    The gene symbol are all like: NMB0001, NMB0010, NMB0015...

    Is there any automated way (scripts, tools,..) in order to obtain the fasta sequences?

    Thanks in advance,
    Giorgio

  • #2
    Download gene_info and gene2accession from NCBI.

    run these :
    "wget -nc ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz"
    "wget -nc ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz"
    Run gunzip to get these two files uncompressed and readable.

    Get your genename(s) converted to entrez gene ids (second column), make sure taxonomy id is 122586 (. in your case).
    Convert the gene names to GIs (gene info identifiers) or accessions.

    Like this ...

    bash-3.00# grep NMB0001 gene_info
    122586 902103 NMB0001 NMB0001 - - - - acetyltransferase protein-coding - - - - 20120121

    bash-3.00# grep "122586.902103" gene2accession
    122586 902103 - - - AAF40480.1 7225226 AE002098.2 66731897 - - ? -
    122586 902103 PROVISIONAL - - NP_273067.1 15675949 NC_003112.2 77358697 6 497 - -

    Cut out the the accession(s) you want, either RNA or Protein. Put them into a flat text file as a list.
    You'll want to program using bash/perl/c/command line utilities, whatever to automate the two basic commands above. Be careful.

    There's supposedly "batch entrez" : http://www.ncbi.nlm.nih.gov/sites/batchentrez

    But I've found it to be unreliable. It drops some accessions. Sad but true. You're welcome to try it.

    The absolute best utility for batch downloading is "idfetch" : http://man.cx/idfetch%281%29 . It's command line, takes parameters, you can give it a file of your accessions or GIs and crank it up. ....downloading ... downloading ... downloading ... wahlah ... your file appears ...just like you asked for. I keep an old copy from an old NCBI C toolkit around and ... apparently it's still available in ncbi-tools-bin package ...

    I run this on Linux/Unix command line ...
    sudo apt-get install ncbi-tools-bin
    and idfetch is there !!! So you can install the ncbi-tools-bin package and get it.
    ___
    sudo apt-get install ncbi-tools-bin

    $ man idfetch | head
    IDFETCH(1) NCBI Tools User's Manual IDFETCH(1)



    NAME
    idfetch - retrieve biological data from the NCBI ID1 server

    SYNOPSIS
    idfetch [-] [-F str] [-G filename] [-Q filename] [-c N] [-d str] [-e N]
    [-f str] [-g N] [-i N] [-l filename] [-n] [-o filename] [-q str] [-s str]
    ....
    Last edited by Richard Finney; 01-24-2013, 08:56 AM.

    Comment


    • #3
      Thank you so much for your time and suggestions. Actually I tried 'batch entrez' and as you reported, it's not so good; so you're right about that.

      I really like your last option and I'm gonna try that.

      Thanks again.

      Cheers,
      Giorgio

      Comment


      • #4
        hi,
        you might also wanna look into the BiomaRt package for R.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        17 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        22 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        16 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        46 views
        0 likes
        Last Post seqadmin  
        Working...
        X