Hi all, this might sound stupid but I would like to know how to combine multiple genbank files into one continuous file. I tried using <cat> to combine 4 gbk files from 4 chromosomes but when I try to view it on Artemis, only the first can be displayed. I found a thread where people suggested UGENE but I don't know why the software only put the annotation from 4 files side by side, but did not shift the nucleotide and amino sequences accordingly and it all appeared to be all NNNNN or XXXXX. I would like to view all the 4 chromosomes as one full genome in Artemis, so....if anyone can help me with this. Thanks and have a nice day!!!
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
-
Concatenating GenBank files to make a multi-record GenBank file should work fine - the question is how to get Artemis to display it nicely
From memory last time I did this I had to merge the GenBank files into a single dummy record, for which I added an NNNNNNNNN spacer between the records so this break was shown in Artemis visually. I did that with Biopython if that is an interesting solution for you.
-
How to use biopython to concatenate genbank files
Originally posted by maubp View PostConcatenating GenBank files to make a multi-record GenBank file should work fine - the question is how to get Artemis to display it nicely
From memory last time I did this I had to merge the GenBank files into a single dummy record, for which I added an NNNNNNNNN spacer between the records so this break was shown in Artemis visually. I did that with Biopython if that is an interesting solution for you.
Thanks,
Justin
Comment
-
Something like this, where you should replace "N"*50 with your desired spacer (this would be 50 N characters), and generalise for as many records as you have:
Code:from Bio import SeqIO recordA = SeqIO.read("fileA.gbk", "genbank") recordB = SeqIO.read("fileB.gbk", "genbank") combined_record = recordA + ("N" * 50) + recordB SeqIO.write(combined_record, "combined.gbk", "genbank")
Comment
-
Hi,
the python script doesn't work on my genbank files.
"ValueError: Need a Nucleotide or Protein alphabet"
Anyway.. If I have several genbank files can I use the function:
Code:in_genbank = open(sys.argv[2], 'r') out_genbank = open(sys.argv[3], 'w')
Code:recordA = SeqIO.read("fileA.gbk", "genbank") recordB = SeqIO.read("fileB.gbk", "genbank") SeqIO.write(combined_record, "combined.gbk", "genbank")
Code:$python gbk_merge.py *.gbk out.gbk
Comment
-
Normally the first line (the LOCUS line) of a GenBank file says if the record is DNA or protein - and Biopython uses that information. If this failed I would expect this error on trying to save as GenBank format:
"ValueError: Need a Nucleotide or Protein alphabet"
Can you show us the start of your GenBank records? e.g. at the Linux terminal prompt try:
Code:$ head fileA.gbk
Last edited by maubp; 08-29-2016, 01:53 AM. Reason: Markdown doesn't work here, need to use forum's markup
Comment
-
This is the head:
$ head fileA.gbk
LOCUS DG_Contig_04
DEFINITION seqnum=1;seqlen=498669;seqhdr="DG_final_Contig_04";version=Prodigal.v2.50;run_type=Single;model="Ab initio";gc_cont=46.95;transl_table=11;uses_sd=1
FEATURES Location/Qualifiers
gene complement(3..413)
/ID="LPDG_0036"
/locus_tag="LPDG_0036"
/note="ID=1_1;partial=10;start_type=ATG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;score=21.71;cscore=5.19;sscore=16.52;rscore=10.95;uscore=2.29;tscore=3.94"
CDS complement(3..413)
/ID="LPDG_0036"
/locus_tag="LPDG_0036"
$head fileB.gbk
LOCUS DG_Contig_05
DEFINITION seqnum=1;seqlen=45471;seqhdr="DG_final_Contig_05";version=Prodigal.v2.50;run_type=Single;model="Ab initio";gc_cont=46.03;transl_table=11;uses_sd=1
FEATURES Location/Qualifiers
gene 3..176
/ID="LPDG_0520"
/locus_tag="LPDG_0520"
/note="ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;score=1.94;cscore=-0.73;sscore=2.68;rscore=0.00;uscore=3.18;tscore=0.00"
CDS 3..176
/ID="LPDG_0520"
/locus_tag="LPDG_0520"
Comment
-
You need to use the forums's [ code ] and [ /code ] tags (without spaces) in order to preserve the white space - but this confirms you don't have proper GenBank files.
Biopython should be giving you a warning, but the workaround is to tell the parser these are DNA sequences. Does this work for you with two files?
Code:from Bio.Alphabet import generic_dna from Bio import SeqIO recordA = SeqIO.read("fileA.gbk", "genbank", generic_dna) recordB = SeqIO.read("fileB.gbk", "genbank", generic_dna) combined_record = recordA + ("N" * 50) + recordB SeqIO.write(combined_record, "combined.gbk", "genbank")
Comment
-
Thanks for emailing the files. Hmm. Where did this "GenBank" files come from? That would be useful for us to know (e.g. for other people using the same tool), but right now my suggestion for setting the alphabet does not work.
Workaround (tested):
Code:from Bio.Alphabet import generic_dna from Bio import SeqIO recordA = SeqIO.read("fileA.gbk", "genbank") recordA.seq.alphabet = generic_dna recordB = SeqIO.read("fileB.gbk", "genbank") recordB.seq.alphabet = generic_dna combined_record = recordA + ("N" * 50) + recordB SeqIO.write(combined_record, "combined.gbk", "genbank")
Comment
-
Version which can be used on multiple files. In order to keep the interface simple this just takes input filenames and writes the record to stdout, e.g.
Code:$ python merge_gbk.py fileA.gbk fileB.gbk > combined.gbk Loading fileA.gbk Loading fileB.gbk Merging into one record
Code:$ python merge_gbk.py input_folder/*.gbk > output_folder/combined.gbk
Code:#!/usr/bin/env python # Copyright 2016 Peter Cock, James Hutton Institute # This example Biopython script is released for free re-use under # the MIT license https://opensource.org/licenses/MIT # For backgound see http://seqanswers.com/forums/showthread.php?t=28924 import sys from Bio.Alphabet import generic_dna from Bio import SeqIO if len(sys.argv) < 2: print("Usage: Takes as arguments filenames for input DNA GenBank files,") print("outputs merged record to stdout using 50bp runs of N as linker.") print("") print("python merge_gbk.py input*.gbk > combined.gbk") # argv[0] = python script # argv[1] = first record # argv[2] onwards = subsequence records filename = sys.argv[1] sys.stderr.write("Loading %s\n" % filename) combined_record = SeqIO.read(filename, "genbank") # Forcing DNA alphabet in case of improper input files combined_record.seq.alphabet = generic_dna for filename in sys.argv[2:]: sys.stderr.write("Loading %s\n" % filename) record = SeqIO.read(filename, "genbank") record.seq.alphabet = generic_dna combined_record = combined_record + ("N" * 50) + record sys.stderr.write("Merging into one record\n") count = SeqIO.write(combined_record, sys.stdout, "genbank") assert count == 1, count
Comment
Latest Articles
Collapse
-
by seqadmin
Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...-
Channel: Articles
03-22-2024, 06:39 AM -
-
by seqadmin
The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.
Avian Conservation
Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...-
Channel: Articles
03-08-2024, 10:41 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 03-27-2024, 06:37 PM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
03-27-2024, 06:37 PM
|
||
Started by seqadmin, 03-27-2024, 06:07 PM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
03-27-2024, 06:07 PM
|
||
Started by seqadmin, 03-22-2024, 10:03 AM
|
0 responses
53 views
0 likes
|
Last Post
by seqadmin
03-22-2024, 10:03 AM
|
||
Started by seqadmin, 03-21-2024, 07:32 AM
|
0 responses
69 views
0 likes
|
Last Post
by seqadmin
03-21-2024, 07:32 AM
|
Comment