Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to find all gi's for an organism in the nucleotide database?

    I am trying to run a command that will give me a list of all gi's given a specific tax id. The idea is to eventually use the ouput.txt as an exclusion list for a blastn procedure.

    I am basing this off of the documentation cookbook found here.

    I reproduce the interesting cookbook command here below:
    Code:
    blastdbcmd -entry all -db ecoli -dbtype nucl -outfmt %g | head -1 | \
    tee exclude_me
    Let us suppose I want to find all the gi's associated with from the nt database? I have something like the following:

    Code:
    blastdbcmd -db nr -entry all -outfmt "%g %T" | awk '{ if ($2 == 7227) {print $1} }'
    I began the job on my local cluster, but I cannot help that perhaps I might be making a mistake? My hope is the above command will give me all the gi's associated with Drosophila with taxid = 7227. Does this look right to you all?
    Last edited by hlyates; 04-13-2015, 08:52 AM.

  • #2
    Did the above work? That is probably the way to exclude GI's that are in the the nr db.

    One other way to get the list of gi's would be to do a taxonomy browser based search (e.g. http://www.ncbi.nlm.nih.gov/nuccore/?term=txid7227[Organism:noexp], replace the taxid with one for organism of interest, example is for fly). Click on display settings and then choose "gi list" in format column. Send the result to a file.

    Comment


    • #3
      Assuming all the BLAST nr database contains all the GI entries, your approach looks viable.

      You could also try using the NCBI Entrez interface, although that has complications too e.g.http://blastedbio.blogspot.co.uk/201...-chimeras.html

      Comment


      • #4
        Originally posted by GenoMax View Post
        Did the above work? That is probably the way to exclude GI's that are in the the nr db.

        One other way to get the list of gi's would be to do a taxonomy browser based search (e.g. http://www.ncbi.nlm.nih.gov/nuccore/?term=txid7227[Organism:noexp], replace the taxid with one for organism of interest, example is for fly). Click on display settings and then choose "gi list" in format column. Send the result to a file.
        I need to do this on organism 6942, but I am getting no results on http://www.ncbi.nlm.nih.gov/nuccore/?term=txid6942. Okay, this is where it gets weird, I can exclude it from a blastn search online. See attachment.

        So why does 6942 not show up in my browser search, but it does show up on the blastn exclusion? I really need that list of gi's for 6942 and super confused why I'm getting this behavior?

        Can anyone help me figure out how to find the gi's for 6942? I know the script command I wrote works, but completely blindsided by the above behavior.
        Attached Files

        Comment


        • #5
          You need to search in the organism field, not the default search. Try:

          Code:
          http://www.ncbi.nlm.nih.gov/nuccore/?term=txid6942[orgn]
          (the square bracketed orgn, short for organism, should be part of the URL or search text)

          Comment


          • #6
            Originally posted by maubp View Post
            You need to search in the organism field, not the default search. Try:

            Code:
            http://www.ncbi.nlm.nih.gov/nuccore/?term=txid6942[orgn]
            (the square bracketed orgn, short for organism, should be part of the URL or search text)
            Are you aware of an organism option for the blastdbcmd so that I will not be returning blank answers for my script command as well sir? I thought -entry all would take care of this problem? But I am getting back blank results with my commandline based approach. Why am I getting different results on the browser versus the commandline approach? I know my syntax is correct, but something is still missing for finding taxid=6942?
            Last edited by hlyates; 04-13-2015, 08:52 AM.

            Comment


            • #7
              I know you are doing things this way because you have been specifically asked to do them this way. You could save a whole lot of time/effort by using BBSplit and tick sequences in a file that you want to exclude. Effort you could put towards something more useful. Could this be used to argue a case?

              Comment


              • #8
                This should get you all entries that have "Amblyomma" in name from nr. You should be able to get the gi's you need from the headers.

                Code:
                $ blastdbcmd -db nr -entry all | grep "Amblyomma" > filename
                Last edited by GenoMax; 04-13-2015, 10:01 AM.

                Comment


                • #9
                  Originally posted by GenoMax View Post
                  This should get you all entries that have "Amblyomma" in name from nr. You should be able to get the gi's you need from the headers.

                  Code:
                  $ blastdbcmd -db nr -entry all | grep "Amblyomma" > filename
                  The docs say this should be possible

                  I just don't understand why this is having so much trouble grabbing the gi and taxonomy info only and then letting me choose the taxid like they state in the docs I just provided. According to it, "%g %t" should
                  get me going. I wonder, are these options case sensitive? I originally had it as "%g %T" which is what was on another ncbi official doc. This could be going down a rabbit hole.

                  Let's come full circle. It is a mystery why I have to specify organism and taxid on the browser search. I was wondering if there was a similar technique for the commandline. I guess not? I'll go with the grep approach if I have to, but that means I will have to write a python script to throw away everything except the gi.
                  Last edited by hlyates; 04-13-2015, 01:24 PM.

                  Comment


                  • #10
                    I am not sure if the nr database is built to include txid information. So even though the command you have is right (may need single quotes '%g %T') it is not producing any output.

                    To get an authoritative answer email blast tech support @NCBI with this question: [email protected]

                    Comment


                    • #11
                      There is a file, gi_taxid_nucl.dmp.gz, which lists all gi numbers and related taxids, in 2-column format (column 1 is gi number, column 2 is taxid). It's quite useful, but very big.

                      You can get it here:
                      ftp://ftp.ncbi.nih.gov/pub/taxonomy/

                      Although, maybe what you're doing with blast is already equivalent; I'm not really sure.

                      Comment


                      • #12
                        Originally posted by GenoMax View Post
                        I am not sure if the nr database is built to include txid information. So even though the command you have is right (may need single quotes '%g %T') it is not producing any output.

                        To get an authoritative answer email blast tech support @NCBI with this question: [email protected]
                        Thanks. I am going to email them. I'll share what I learn. I know scripts can be picky and so will run it with the single quotes. Thank you kind sir for your assistance.

                        Comment


                        • #13
                          Originally posted by hlyates View Post
                          Are you aware of an organism option for the blastdbcmd so that I will not be returning blank answers for my script command as well sir? I thought -entry all would take care of this problem? But I am getting back blank results with my commandline based approach. Why am I getting different results on the browser versus the commandline approach? I know my syntax is correct, but something is still missing for finding taxid=6942?
                          The NR database contains merged records where the same protein sequence was found in multiple organisms - therefore it will have a primary identifier and secondary identifiers.

                          For these cases you should double check what happens with the taxonomy information - since you'd want to check all the taxonomy ids of the merged record. It might be that blastdbcmd -outfmt %T only gives the first taxonomy id, and so fails to find all your matches?

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Advancing Precision Medicine for Rare Diseases in Children
                            by seqadmin




                            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                            12-16-2024, 07:57 AM
                          • seqadmin
                            Recent Advances in Sequencing Technologies
                            by seqadmin



                            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                            Long-Read Sequencing
                            Long-read sequencing has seen remarkable advancements,...
                            12-02-2024, 01:49 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 12-17-2024, 10:28 AM
                          0 responses
                          25 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-13-2024, 08:24 AM
                          0 responses
                          42 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-12-2024, 07:41 AM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-11-2024, 07:45 AM
                          0 responses
                          42 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X