Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Biopython - want to get a batch of amino acid fastas from list of entrez gene_ids

    I have a list of Entrez Gene IDs (~100) and I would like to obtain the amino acid fastas of each and create a multi-fasta file.

    I'm trying to do this using the Entrez.efetch function in biopython but I'm not sure how to retrieve the amino acid sequence from the gene file.

    Any ideas?

  • #2
    An easy way to get the sequence is to ask entrez.efetch() to return a FASTA formatted sequence, as described in the Biopython tutorial at http://biopython.org/DIST/docs/tutor...al.html#htoc55 - note the rettype="fasta" argument. You can then treat this as any other FASTA stream (i.e. as if it were a file).

    Comment


    • #3
      That should work perfectly.

      Can biopython convert ids? For example from entrez GeneIDs to protein accession numbers?

      Comment


      • #4
        The NCBI can convert the gene IDs to protein IDs, try Entrez link (elink). See also:

        Comment


        • #5
          Ok so using the tutorial, I developed the following code (using trial and error):

          from Bio import Entrez
          from Bio import SeqIO
          Entrez.email = "my_name@my_website.com"
          id_list = set(open('pids_test.csv', 'rU'))
          handle = Entrez.efetch(db="protein", rettype="fasta", retmode="text", \
          id=id_list)
          for seq_record in SeqIO.parse(handle, "fasta"):
          print ">" + seq_record.id, seq_record.description
          print seq_record.seq
          handle.close()

          this prints exactly what I want. I have two questions:

          1) how can I get the results into a text file, rather than printing them in my output?

          2) how can I let the user specify the input file (command line is fine)?

          K

          Comment


          • #6
            To save the NCBI FASTA formatted data to a file, try something like this:

            Code:
            from Bio import Entrez
            from Bio import SeqIO
            Entrez.email = "my_name@my_website.com"
            id_list = set(open('pids_test.csv', 'rU'))
            handle = Entrez.efetch(db="protein", rettype="fasta", retmode="text", \
            id=id_list)	
            out_handle = open("saved.fasta", "w")
            for line in handle:
                out_handle.write(line)
            out_handle.close()
            handle.close()
            P.S. There a very similar example in the Biopython Tutorial in the section "EFetch: Downloading full records from Entrez"


            If you want to take the filename from the command line, learn about sys.argv, while to prompt the user try the input function or similar. Any good introduction to Python should cover this.
            Last edited by maubp; 01-08-2013, 09:34 AM. Reason: Added link

            Comment


            • #7
              Worked great, added sys.argv to allow user to specify file input and output:


              import sys
              from Bio import Entrez
              from Bio import SeqIO
              Entrez.email = "xxxxxXXXXXxxxxx"
              id_list = set(open(sys.argv[1], 'rU'))
              handle = Entrez.efetch(db="protein", rettype="fasta", retmode="text", \
              id=id_list)
              out_handle = open(sys.argv[2], 'w')

              for line in handle :
              out_handle.write(line)
              out_handle.close()

              handle.close()

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              27 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              31 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              27 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X