Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Slicing genbank file using biopython [problem]

    Hello, i'd like to slice multiple genbank files using biopython at different location:
    slice genome 1 at location 1. I have already coded this :
    >>> ident = 'AE009948','AE009947'
    >>> coor = '1256617:1311411','1973169:2005648'
    >>> for i in ident:
    from Bio import Entrez, SeqIO
    Entrez.email = "[email protected]"
    handle = Entrez.efetch(db="nucleotide", id=i, rettype="gb")
    record = SeqIO.read(handle, "gb")
    >>> for j in coor:
    sub_record = record[j]
    Problem is : i get this error : ValueError: Invalid index
    or TypeError: 'SeqRecord' object is not callable if i try with : sub_record = record(j)
    Can someone help me?
    Thanks by advance

  • #2
    You have a couple problems. Firstly the values inside "coor" are strings, not ranges. So trying to use them directly as ranges won't work. You could try:

    Code:
    coors = [[1256617, 1311411], [1973169, 2005648]]
    for bounds in coors :
        sub_record = record[bounds[0]:bounds[1]]
    and that would likely work. Of course, then you run into the problem that the coordinates you gave are beyond the end of the sequence you retrieved. Also, you ask for two records and then overwrite the first with the second. I presume you want the "foo j in coor :" loop inside the "for i in indent :" loop.

    Comment


    • #3
      Yeah you are right, i want to loop "for bounds in coors" inside the "for i in ident", i ll try your suggestion then tell you if it works. Anyway, i thank you for you time !

      Edit : i want to slice the first genome with locations one, then slice second genome with the second locations ( later it will be 100 genome and 100 locations )

      Anyway i'll try with your suggestion and come back later !
      Last edited by CrLs; 01-20-2014, 04:58 AM.

      Comment


      • #4
        Ah, then just get rid of the "for j in coor" loop, since you're already setting the index for coor if you nest that within the "for i in indent" loop.

        Comment


        • #5
          well, at the moment it give me back this error :
          TypeError: slice indices must be integers or None or have an __index__ method
          Should i change my coor for something else ?
          And about to remove the "for j in coor loop" , how can i nest that with the "for i in ident" loop ? something like for i in ident and for bounds in coors: ?

          Again, thanks a lot for your answer !

          Edit : i changed my coor, i forgot to put the '[ ]', my bad !

          Comment


          • #6
            Watch out for different counting conventions when you do the slicing...

            Also, you could ask the NCBI to pre-slice the records when you call Entrez.efetch by including the optional seq_start and seq_end arguments, see: http://www.ncbi.nlm.nih.gov/books/NB...hapter4.EFetch

            Comment


            • #7
              Hello
              Yep thank you, i ll check it !
              Hmm, to use optionals arguments , i should put 3 loops ? one with genome, one with start and one with stop right ?( i want to slice the first genome with the first location, second genome with 2 location ect )

              Comment


              • #8
                I would use ONE loop, something like this:

                Code:
                from Bio import Entrez, SeqIO
                Entrez.email = "[email protected]"
                for i, start, end in [('AE009948', 1256617, 1311411),
                                      ('AE009947', 1973169, 2005648)]:
                    print("Fetching %s:%i-%i now..." % (i, start, end))
                    #code here using Entrez.efetch(...)
                Last edited by maubp; 01-20-2014, 09:00 AM. Reason: typo

                Comment


                • #9
                  Well, thanks you for your answer, i'll try your way and the old way, i ll keep the faster ! ( i dont know if one take more memory than the other )
                  Anyway, Thanks you a lot ! I come back with a working code when i'm done with it

                  Comment


                  • #10
                    Ok Peter and Ryan thanks you for your help !
                    This is the working code, get you all the product ( or everything else you need ) between the location you want
                    Code:
                    >>> for i, start, end in [('AE009948', 1256617, 1311411),
                                          ('AE009948', 1973169, 2005648)]:
                    	handle = Entrez.efetch(db="nucleotide", id=i, seq_start=start,
                                seq_stop=end, rettype="gb")
                    	results2 = open('resultsRegion_note.csv', 'a')
                    	for seq_record in SeqIO.parse(handle, "gb"):
                    		results2.write('\n')
                    	for feature in seq_record.features:
                    			if feature.type=="CDS":
                    				results2.write(str(feature.qualifiers.get('product'))[1:-1])
                    	results2.close()
                    feel free to use (even if i think a lot of people can do the same )
                    Last edited by CrLs; 01-20-2014, 09:14 AM.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 08:47 AM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    59 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    54 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X