Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • surprisingly hard: Going from Genbank Accession number to Genome Name

    Hi everybody,

    I wrote an automated FASTQ based 16S rrna searcher, so you give it your FASTQ and it tells you which 16S matches it best. Although most people should know which genome they sequenced, I enjoy my computer telling me what I did ;-) It may also help you spot contaminants. Code on https://github.com/beaumontlab/antonie

    However - the Green Genes database (at http://greengenes.secondgenome.com/downloads ) gives me a Genbank Accession Number, like this:
    Best current guess: Genbank GU198115.1

    But I'd like to show my user "Pseudomonas fluorescens strain LMG 7207 16S ribosomal RNA gene, partial sequence."

    I have found several e-utils queries that work, like http://eutils.ncbi.nlm.nih.gov/entre...ta&retmode=xml

    But these often deliver the entire genome, which I really don't need! Is there a way to send a limited query to only get TSeq_defline or TSeq_orgname?

    Or alternatively, is there a database of accession numbers/names that I can download somewhere?

    Thanks!

  • #2
    gi2taxid
    Code:
    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=taxonomy&id=???
    taxid2data
    Code:
    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=???
    You'll probably first need to turn those accessions to gis..
    savetherhino.org

    Comment


    • #3
      Hi,

      Thanks, this works (once I've figured out the gi!). This URL appears to deliver the GI based on the accession number: http://eutils.ncbi.nlm.nih.gov/entre...somedomain.com

      And it also delivers a human friendly name!

      EDIT: oops, for some accession numbers (like AM181176.4) it still delivers far too much data ;-(

      Thanks!
      Last edited by berthubert; 01-25-2014, 10:50 AM. Reason: oops, some urls still too much data

      Comment


      • #4
        Output now looks like this, but for *some* accession numbers, the URL still returns a huge amount of data ;-(

        $ ./16ssearcher gg_13_5.fasta P1-1-35_S5_L001_R*fastq
        -> Best current guess: 1111855 (2024) -> Genbank HQ911364.1
        ORGANISM Pseudomonas fluorescens
        2 potentials out of 240 candidates -> Best current guess: 1111132 (3456) -> Genbank GU437272.1
        ORGANISM uncultured bacterium
        10 potentials out of 2390 candidates -> Best current guess: 1105115 (3457) -> Genbank JF262574.1
        ORGANISM Pseudomonas sp. UYSO19
        158 potentials out of 186282 candidates -> Best current guess: 790134 (3652) -> Genbank HM190225.1
        ORGANISM Pseudomonas marginalis pv. marginalis
        363 potentials out of 264000 candidates -> Best current guess: 589242 (3946) -> Genbank GU198113.1
        ORGANISM Pseudomonas fluorescens
        364 potentials out of 264069 candidates -> Best current guess: 588382 (4362) -> Genbank GU198112.1
        ORGANISM Pseudomonas fluorescens
        368 potentials out of 267000 candidates -> Best current guess: 585665 (4362) -> Genbank GU198115.1
        ORGANISM Pseudomonas fluorescens
        2353 potentials out of 683524 candidates -> Best current guess: 16810 (4644) -> Genbank AF336349.1
        ORGANISM Pseudomonas fluorescens
        3049 potentials out of 999000 candidates -> Best current guess: 3860764 (4980) -> Genbank NC_012660.1
        ORGANISM Pseudomonas fluorescens SBW25
        3431 potentials out of 1166308 candidates -> Best current guess: 4408488 (4980) -> Genbank AM181176.4
        ORGANISM Pseudomonas fluorescens SBW25


        Forging on...

        Comment


        • #5
          And we have a winner, thanks to John Eargle on #bioinformatics:


          Where AE000520 is the Accession Number. Thanks!
          Last edited by berthubert; 01-25-2014, 12:52 PM. Reason: typo

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          51 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          67 views
          0 likes
          Last Post seqadmin  
          Working...
          X