Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract only sequence ids from fasta file with makeblastdb

    Hi all,
    i'm new about learning blast and i'm exploring now its functions by command line.
    I already know that to make a blastx i have first to indicize my fasta db with makeblastdb.
    I already used blast to learn how it works and I would that in the output not all the informations about the sequence are present (code, description,..etc) but only the sequence code.
    How can I do it? Somewhere I read that I have to give some parameter to the makeblastdb command.... someone here knows what?

    Thanks at all..

  • #2
    When do do a BLAST search (e.g. blastp or blastn), there are several different output formats. The plain text and XML have the original FASTA record descriptions, however this is not (currently) available in the tabular output.
    This is an open letter to the NCBI BLAST+ team to request two simple enhancements which I think would be extremely useful - first and foremo...


    Is that what you meant?

    Comment


    • #3
      Yes.. maybe it has been useful. I find that maybe I could do it also with makeblastd. Because my problem is that I would that blast won't use the complete file with all the informations for each sequence but only the sequence id.
      So, in example, the command can be this:

      makeblastdb -in db.fasta -title db -parse_seqids -gi_mask

      What do you think about?

      And maybe later I could use the command blastx with -outfmt "6 qgi sgi"
      to let me see only a table with the results and only showing GI for query and sequence..

      I'm trying executing them since I don't know if there is a way to see how it has done the db with makeblastdb.

      Comment


      • #4
        I only use -parse_seqids if my FASTA files are labeled using the NCBI style with pipe characters (the vertical bases, |, are called pipes). Otherwise I find it doesn't work very well.

        Comment


        • #5
          The format of my fasta file are from NCBI and it look like this

          tr|H3ISY8|H3ISY8_STRPU description OrganismType Other params

          I want that blast use only the first sequence code: H3ISY8

          And show me only these in the results...

          The command I've written bring me a "0 0 0" file... I don't know why.

          If I erase the -outfmt "6 qgi sgi" and tell it only "-outfmt "6" it returns a correct table.
          I'm continuing trying with different parameters as input.

          Comment


          • #6
            So finally, I've seen a lot of parameter and cannot do it. Can it be concluded that is it not permitted to create the binary database that blast uses, only using the sequence id number?

            And there is also no way to have with blastx, in our results, only this code instead that the three parts separated by pipe (|).

            Comment


            • #7
              Originally posted by angeloulivieri View Post
              The format of my fasta file are from NCBI and it look like this

              tr|H3ISY8|H3ISY8_STRPU description OrganismType Other params

              I want that blast use only the first sequence code: H3ISY8
              The simplest way to do that is to make a new FASTA file using that as the ID, and make a BLAST database from that.

              Personally I'd use the database as is and process the BLAST output in a script instead.

              Comment


              • #8
                ok thanks... someone said me that there is a parameter to give to makeblastx... but maybe he's wrong...

                Comment


                • #9
                  Originally posted by angeloulivieri View Post
                  ok thanks... someone said me that there is a parameter to give to makeblastx... but maybe he's wrong...
                  As mentioned earlier, you might be able to do it via the makeblastdb -parse_seqids option, but that requires your sequence identifiers follow the NCBI naming conventions with the pipe ("|") symbol.

                  If your FASTA file identifiers are not already in the expected format, you'd have to modify the FASTA file - and in my view in that case you might as well avoid using this option, and simply format the identifiers exactly as you want them.

                  Comment


                  • #10
                    Originally posted by maubp View Post
                    As mentioned earlier, you might be able to do it via the makeblastdb -parse_seqids option, but that requires your sequence identifiers follow the NCBI naming conventions with the pipe ("|") symbol.

                    If your FASTA file identifiers are not already in the expected format, you'd have to modify the FASTA file - and in my view in that case you might as well avoid using this option, and simply format the identifiers exactly as you want them.
                    My FASTA file have this kind of header for each sequence:


                    tr|I1GCL2|I1GCL2_AMPQE Uncharacterized protein OS=Amphimedon queenslandica GN=LOC100637533
                    PE=4 SV=1


                    I would that makeblastdb uses only the ID I1GCL2 as identifier. This could be interesting for me since I want the minor possible heavy database to manage. I already have the other informations collected in a db.

                    I used this command
                    makeblastdb -in uniprot_kb_2012_06.fasta -title uniprot_kb_2012_06 -parse_seqids

                    but it doesn't work as I thought... it collects all the informations of the header :-(
                    Last edited by angeloulivieri; 07-26-2012, 02:53 AM.

                    Comment


                    • #11
                      no one knows how to do it?

                      Comment


                      • #12
                        You haven't said which output format you are using. The specially formatted identifiers (with the pipe characters) are how BLAST identifies an accession number - which you can ask for explicitly when using the tabular output.
                        Last edited by maubp; 07-30-2012, 02:39 AM. Reason: corrected typo

                        Comment


                        • #13
                          I know that when run blastx I can obtain a tabular output with only the the Accession Numbers but it is a different problem. I would have that when the program makeblastdb creates its binary format db it takes only the accession. The reason is that I already have accessions->descriptions in a db and this way could be useful to reduce the quantity of informations to manage when later I run blastx. I hope to be clear...

                          (Maybe something could be done by formatdb command but I see that it's an old command)

                          Comment


                          • #14
                            Originally posted by angeloulivieri View Post
                            I know that when run blastx I can obtain a tabular output with only the the Accession Numbers but it is a different problem. I would have that when the program makeblastdb creates its binary format db it takes only the accession. The reason is that I already have accessions->descriptions in a db and this way could be useful to reduce the quantity of informations to manage when later I run blastx. I hope to be clear...

                            (Maybe something could be done by formatdb command but I see that it's an old command)
                            The old 'legacy' BLAST suite had commands 'formatdb' and 'blastall', but those are replaced in the new BLAST+ suite by 'makeblastdb' and for running BLAST you have get separate tools 'blastp', 'blastn', etc.

                            Anything you could do with 'formatdb' would (I hope) be supported in 'makeblastdb'.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            29 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X