Hi all, this might sound stupid but I would like to know how to combine multiple genbank files into one continuous file. I tried using <cat> to combine 4 gbk files from 4 chromosomes but when I try to view it on Artemis, only the first can be displayed. I found a thread where people suggested UGENE but I don't know why the software only put the annotation from 4 files side by side, but did not shift the nucleotide and amino sequences accordingly and it all appeared to be all NNNNN or XXXXX. I would like to view all the 4 chromosomes as one full genome in Artemis, so....if anyone can help me with this. Thanks and have a nice day!!!
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
-
Concatenating GenBank files to make a multi-record GenBank file should work fine - the question is how to get Artemis to display it nicely
From memory last time I did this I had to merge the GenBank files into a single dummy record, for which I added an NNNNNNNNN spacer between the records so this break was shown in Artemis visually. I did that with Biopython if that is an interesting solution for you.
-
How to use biopython to concatenate genbank files
Originally posted by maubp View PostConcatenating GenBank files to make a multi-record GenBank file should work fine - the question is how to get Artemis to display it nicely
From memory last time I did this I had to merge the GenBank files into a single dummy record, for which I added an NNNNNNNNN spacer between the records so this break was shown in Artemis visually. I did that with Biopython if that is an interesting solution for you.
Thanks,
Justin
Comment
-
Something like this, where you should replace "N"*50 with your desired spacer (this would be 50 N characters), and generalise for as many records as you have:
Code:from Bio import SeqIO recordA = SeqIO.read("fileA.gbk", "genbank") recordB = SeqIO.read("fileB.gbk", "genbank") combined_record = recordA + ("N" * 50) + recordB SeqIO.write(combined_record, "combined.gbk", "genbank")
Comment
-
Hi,
the python script doesn't work on my genbank files.
"ValueError: Need a Nucleotide or Protein alphabet"
Anyway.. If I have several genbank files can I use the function:
Code:in_genbank = open(sys.argv[2], 'r') out_genbank = open(sys.argv[3], 'w')
Code:recordA = SeqIO.read("fileA.gbk", "genbank") recordB = SeqIO.read("fileB.gbk", "genbank") SeqIO.write(combined_record, "combined.gbk", "genbank")
Code:$python gbk_merge.py *.gbk out.gbk
Comment
-
Normally the first line (the LOCUS line) of a GenBank file says if the record is DNA or protein - and Biopython uses that information. If this failed I would expect this error on trying to save as GenBank format:
"ValueError: Need a Nucleotide or Protein alphabet"
Can you show us the start of your GenBank records? e.g. at the Linux terminal prompt try:
Code:$ head fileA.gbk
Last edited by maubp; 08-29-2016, 01:53 AM. Reason: Markdown doesn't work here, need to use forum's markup
Comment
-
This is the head:
$ head fileA.gbk
LOCUS DG_Contig_04
DEFINITION seqnum=1;seqlen=498669;seqhdr="DG_final_Contig_04";version=Prodigal.v2.50;run_type=Single;model="Ab initio";gc_cont=46.95;transl_table=11;uses_sd=1
FEATURES Location/Qualifiers
gene complement(3..413)
/ID="LPDG_0036"
/locus_tag="LPDG_0036"
/note="ID=1_1;partial=10;start_type=ATG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;score=21.71;cscore=5.19;sscore=16.52;rscore=10.95;uscore=2.29;tscore=3.94"
CDS complement(3..413)
/ID="LPDG_0036"
/locus_tag="LPDG_0036"
$head fileB.gbk
LOCUS DG_Contig_05
DEFINITION seqnum=1;seqlen=45471;seqhdr="DG_final_Contig_05";version=Prodigal.v2.50;run_type=Single;model="Ab initio";gc_cont=46.03;transl_table=11;uses_sd=1
FEATURES Location/Qualifiers
gene 3..176
/ID="LPDG_0520"
/locus_tag="LPDG_0520"
/note="ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;score=1.94;cscore=-0.73;sscore=2.68;rscore=0.00;uscore=3.18;tscore=0.00"
CDS 3..176
/ID="LPDG_0520"
/locus_tag="LPDG_0520"
Comment
-
You need to use the forums's [ code ] and [ /code ] tags (without spaces) in order to preserve the white space - but this confirms you don't have proper GenBank files.
Biopython should be giving you a warning, but the workaround is to tell the parser these are DNA sequences. Does this work for you with two files?
Code:from Bio.Alphabet import generic_dna from Bio import SeqIO recordA = SeqIO.read("fileA.gbk", "genbank", generic_dna) recordB = SeqIO.read("fileB.gbk", "genbank", generic_dna) combined_record = recordA + ("N" * 50) + recordB SeqIO.write(combined_record, "combined.gbk", "genbank")
Comment
-
Thanks for emailing the files. Hmm. Where did this "GenBank" files come from? That would be useful for us to know (e.g. for other people using the same tool), but right now my suggestion for setting the alphabet does not work.
Workaround (tested):
Code:from Bio.Alphabet import generic_dna from Bio import SeqIO recordA = SeqIO.read("fileA.gbk", "genbank") recordA.seq.alphabet = generic_dna recordB = SeqIO.read("fileB.gbk", "genbank") recordB.seq.alphabet = generic_dna combined_record = recordA + ("N" * 50) + recordB SeqIO.write(combined_record, "combined.gbk", "genbank")
Comment
-
Version which can be used on multiple files. In order to keep the interface simple this just takes input filenames and writes the record to stdout, e.g.
Code:$ python merge_gbk.py fileA.gbk fileB.gbk > combined.gbk Loading fileA.gbk Loading fileB.gbk Merging into one record
Code:$ python merge_gbk.py input_folder/*.gbk > output_folder/combined.gbk
Code:#!/usr/bin/env python # Copyright 2016 Peter Cock, James Hutton Institute # This example Biopython script is released for free re-use under # the MIT license https://opensource.org/licenses/MIT # For backgound see http://seqanswers.com/forums/showthread.php?t=28924 import sys from Bio.Alphabet import generic_dna from Bio import SeqIO if len(sys.argv) < 2: print("Usage: Takes as arguments filenames for input DNA GenBank files,") print("outputs merged record to stdout using 50bp runs of N as linker.") print("") print("python merge_gbk.py input*.gbk > combined.gbk") # argv[0] = python script # argv[1] = first record # argv[2] onwards = subsequence records filename = sys.argv[1] sys.stderr.write("Loading %s\n" % filename) combined_record = SeqIO.read(filename, "genbank") # Forcing DNA alphabet in case of improper input files combined_record.seq.alphabet = generic_dna for filename in sys.argv[2:]: sys.stderr.write("Loading %s\n" % filename) record = SeqIO.read(filename, "genbank") record.seq.alphabet = generic_dna combined_record = combined_record + ("N" * 50) + record sys.stderr.write("Merging into one record\n") count = SeqIO.write(combined_record, sys.stdout, "genbank") assert count == 1, count
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 11:49 AM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
Today, 11:49 AM
|
||
Started by seqadmin, Yesterday, 08:47 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
Yesterday, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Comment