Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge gbk files

    Hi all, this might sound stupid but I would like to know how to combine multiple genbank files into one continuous file. I tried using <cat> to combine 4 gbk files from 4 chromosomes but when I try to view it on Artemis, only the first can be displayed. I found a thread where people suggested UGENE but I don't know why the software only put the annotation from 4 files side by side, but did not shift the nucleotide and amino sequences accordingly and it all appeared to be all NNNNN or XXXXX. I would like to view all the 4 chromosomes as one full genome in Artemis, so....if anyone can help me with this. Thanks and have a nice day!!!

  • #2
    Concatenating GenBank files to make a multi-record GenBank file should work fine - the question is how to get Artemis to display it nicely

    From memory last time I did this I had to merge the GenBank files into a single dummy record, for which I added an NNNNNNNNN spacer between the records so this break was shown in Artemis visually. I did that with Biopython if that is an interesting solution for you.

    Comment


    • #3
      union from the EMBOSS suite will work, eg.

      $ union -sequence FileA.gbk -sformat genbank -outseq FileB.gbk -osformat genbank -feature Y -auto

      Comment


      • #4
        Originally posted by maubp View Post
        I did that with Biopython if that is an interesting solution for you.
        If BioPerl is more your thing then look at Bio::SeqUtils->cat

        Comment


        • #5
          How to use biopython to concatenate genbank files

          Originally posted by maubp View Post
          Concatenating GenBank files to make a multi-record GenBank file should work fine - the question is how to get Artemis to display it nicely

          From memory last time I did this I had to merge the GenBank files into a single dummy record, for which I added an NNNNNNNNN spacer between the records so this break was shown in Artemis visually. I did that with Biopython if that is an interesting solution for you.
          Would you be able to give some details how to use biopython to merge multiple genbank files into a single sequence with Ns and keep all the annotations with adjusted coordinates?

          Thanks,

          Justin

          Comment


          • #6
            Originally posted by Roy View Post
            If BioPerl is more your thing then look at Bio::SeqUtils->cat
            Could number of Ns be added between two sequences at merging with Bioperl?

            Thanks,

            Justin

            Comment


            • #7
              Something like this, where you should replace "N"*50 with your desired spacer (this would be 50 N characters), and generalise for as many records as you have:

              Code:
              from Bio import SeqIO
              recordA = SeqIO.read("fileA.gbk", "genbank")
              recordB = SeqIO.read("fileB.gbk", "genbank")
              combined_record = recordA + ("N" * 50) + recordB
              SeqIO.write(combined_record, "combined.gbk", "genbank")

              Comment


              • #8
                Hi,
                the python script doesn't work on my genbank files.

                "ValueError: Need a Nucleotide or Protein alphabet"

                Anyway.. If I have several genbank files can I use the function:
                Code:
                in_genbank = open(sys.argv[2], 'r')
                out_genbank = open(sys.argv[3], 'w')
                instead of:
                Code:
                recordA = SeqIO.read("fileA.gbk", "genbank")
                recordB = SeqIO.read("fileB.gbk", "genbank")
                SeqIO.write(combined_record, "combined.gbk", "genbank")
                where sys.arg[2] is: *.gbk


                Code:
                $python gbk_merge.py *.gbk out.gbk

                Comment


                • #9
                  Normally the first line (the LOCUS line) of a GenBank file says if the record is DNA or protein - and Biopython uses that information. If this failed I would expect this error on trying to save as GenBank format:

                  "ValueError: Need a Nucleotide or Protein alphabet"

                  Can you show us the start of your GenBank records? e.g. at the Linux terminal prompt try:

                  Code:
                  $ head fileA.gbk
                  Once you have this working to merge two GenBank records with a spacer we can generalise this to merging many.
                  Last edited by maubp; 08-29-2016, 01:53 AM. Reason: Markdown doesn't work here, need to use forum's markup

                  Comment


                  • #10
                    This is the head:

                    $ head fileA.gbk
                    LOCUS DG_Contig_04
                    DEFINITION seqnum=1;seqlen=498669;seqhdr="DG_final_Contig_04";version=Prodigal.v2.50;run_type=Single;model="Ab initio";gc_cont=46.95;transl_table=11;uses_sd=1
                    FEATURES Location/Qualifiers
                    gene complement(3..413)
                    /ID="LPDG_0036"
                    /locus_tag="LPDG_0036"
                    /note="ID=1_1;partial=10;start_type=ATG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;score=21.71;cscore=5.19;sscore=16.52;rscore=10.95;uscore=2.29;tscore=3.94"
                    CDS complement(3..413)
                    /ID="LPDG_0036"
                    /locus_tag="LPDG_0036"

                    $head fileB.gbk
                    LOCUS DG_Contig_05
                    DEFINITION seqnum=1;seqlen=45471;seqhdr="DG_final_Contig_05";version=Prodigal.v2.50;run_type=Single;model="Ab initio";gc_cont=46.03;transl_table=11;uses_sd=1
                    FEATURES Location/Qualifiers
                    gene 3..176
                    /ID="LPDG_0520"
                    /locus_tag="LPDG_0520"
                    /note="ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;score=1.94;cscore=-0.73;sscore=2.68;rscore=0.00;uscore=3.18;tscore=0.00"
                    CDS 3..176
                    /ID="LPDG_0520"
                    /locus_tag="LPDG_0520"

                    Comment


                    • #11
                      You need to use the forums's [ code ] and [ /code ] tags (without spaces) in order to preserve the white space - but this confirms you don't have proper GenBank files.

                      Biopython should be giving you a warning, but the workaround is to tell the parser these are DNA sequences. Does this work for you with two files?

                      Code:
                      from Bio.Alphabet import generic_dna
                      from Bio import SeqIO
                      recordA = SeqIO.read("fileA.gbk", "genbank", generic_dna)
                      recordB = SeqIO.read("fileB.gbk", "genbank", generic_dna)
                      combined_record = recordA + ("N" * 50) + recordB
                      SeqIO.write(combined_record, "combined.gbk", "genbank")

                      Comment


                      • #12
                        now I receive this error:

                        Code:
                        ValueError: Specified alphabet DNAAlphabet() clashes with that determined from the file, Alphabet()
                        I think I have not Bio.Alphabet...

                        Comment


                        • #13
                          Can you email me the two input files so I can test this locally (on GoogleMail - I'll message you via the forum)? Also please confirm which version of Biopython you have.

                          Comment


                          • #14
                            Thanks for emailing the files. Hmm. Where did this "GenBank" files come from? That would be useful for us to know (e.g. for other people using the same tool), but right now my suggestion for setting the alphabet does not work.

                            Workaround (tested):

                            Code:
                            from Bio.Alphabet import generic_dna
                            from Bio import SeqIO
                            recordA = SeqIO.read("fileA.gbk", "genbank")
                            recordA.seq.alphabet = generic_dna
                            recordB = SeqIO.read("fileB.gbk", "genbank")
                            recordB.seq.alphabet = generic_dna
                            combined_record = recordA + ("N" * 50) + recordB
                            SeqIO.write(combined_record, "combined.gbk", "genbank")

                            Comment


                            • #15
                              Version which can be used on multiple files. In order to keep the interface simple this just takes input filenames and writes the record to stdout, e.g.

                              Code:
                              $ python merge_gbk.py fileA.gbk fileB.gbk > combined.gbk
                              Loading fileA.gbk
                              Loading fileB.gbk
                              Merging into one record
                              Or, you could use it this, but be careful using *.gbk as the input if your output file will also be in the same directory

                              Code:
                              $ python merge_gbk.py input_folder/*.gbk > output_folder/combined.gbk
                              The script also works on Python 3,

                              Code:
                              #!/usr/bin/env python
                              # Copyright 2016 Peter Cock, James Hutton Institute
                              # This example Biopython script is released for free re-use under
                              # the MIT license https://opensource.org/licenses/MIT
                              # For backgound see http://seqanswers.com/forums/showthread.php?t=28924
                              import sys
                              from Bio.Alphabet import generic_dna
                              from Bio import SeqIO
                              
                              if len(sys.argv) < 2:
                                  print("Usage: Takes as arguments filenames for input DNA GenBank files,")
                                  print("outputs merged record to stdout using 50bp runs of N as linker.")
                                  print("")
                                  print("python merge_gbk.py input*.gbk > combined.gbk")
                              
                              # argv[0] = python script
                              # argv[1] = first record
                              # argv[2] onwards = subsequence records
                              filename = sys.argv[1]
                              sys.stderr.write("Loading %s\n" % filename)
                              combined_record = SeqIO.read(filename, "genbank")
                              # Forcing DNA alphabet in case of improper input files
                              combined_record.seq.alphabet = generic_dna
                              for filename in sys.argv[2:]:
                                  sys.stderr.write("Loading %s\n" % filename)
                                  record = SeqIO.read(filename, "genbank")
                                  record.seq.alphabet = generic_dna
                                  combined_record = combined_record + ("N" * 50) + record
                              sys.stderr.write("Merging into one record\n")
                              count = SeqIO.write(combined_record, sys.stdout, "genbank")
                              assert count == 1, count
                              Last edited by maubp; 08-29-2016, 02:56 AM. Reason: Copyright statement for script

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X