![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to merge two vcf files. | cardillox | Bioinformatics | 3 | 11-27-2019 10:25 PM |
Merge two gff3 files? | BobFreemanMA | Bioinformatics | 3 | 02-05-2013 05:53 AM |
Merge Fasq Files | yagoubali | Illumina/Solexa | 0 | 12-13-2012 06:45 PM |
1. embl/gbk to FASTA conversion; 2. 16s RNA to be found in a embl/gbk file | ashuchawla | Bioinformatics | 3 | 05-16-2012 08:00 AM |
Can we merge 2 csfasta files ? | tdm | SOLiD | 9 | 12-10-2010 10:10 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: Singapore Join Date: Apr 2013
Posts: 7
|
![]()
Hi all, this might sound stupid but I would like to know how to combine multiple genbank files into one continuous file. I tried using <cat> to combine 4 gbk files from 4 chromosomes but when I try to view it on Artemis, only the first can be displayed. I found a thread where people suggested UGENE but I don't know why the software only put the annotation from 4 files side by side, but did not shift the nucleotide and amino sequences accordingly and it all appeared to be all NNNNN or XXXXX. I would like to view all the 4 chromosomes as one full genome in Artemis, so....if anyone can help me with this.
![]() ![]() |
![]() |
![]() |
![]() |
#2 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
Concatenating GenBank files to make a multi-record GenBank file should work fine - the question is how to get Artemis to display it nicely
![]() From memory last time I did this I had to merge the GenBank files into a single dummy record, for which I added an NNNNNNNNN spacer between the records so this break was shown in Artemis visually. I did that with Biopython if that is an interesting solution for you. |
![]() |
![]() |
![]() |
#3 |
Member
Location: Toronto Join Date: Jan 2011
Posts: 30
|
![]()
union from the EMBOSS suite will work, eg.
$ union -sequence FileA.gbk -sformat genbank -outseq FileB.gbk -osformat genbank -feature Y -auto |
![]() |
![]() |
![]() |
#4 |
Member
Location: Sheffield, UK Join Date: Oct 2009
Posts: 17
|
![]() |
![]() |
![]() |
![]() |
#5 | |
Member
Location: Winnipeg Join Date: May 2011
Posts: 18
|
![]() Quote:
Thanks, Justin |
|
![]() |
![]() |
![]() |
#6 |
Member
Location: Winnipeg Join Date: May 2011
Posts: 18
|
![]() |
![]() |
![]() |
![]() |
#7 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
Something like this, where you should replace "N"*50 with your desired spacer (this would be 50 N characters), and generalise for as many records as you have:
Code:
from Bio import SeqIO recordA = SeqIO.read("fileA.gbk", "genbank") recordB = SeqIO.read("fileB.gbk", "genbank") combined_record = recordA + ("N" * 50) + recordB SeqIO.write(combined_record, "combined.gbk", "genbank") |
![]() |
![]() |
![]() |
#8 |
Member
Location: Paris Join Date: Feb 2014
Posts: 36
|
![]()
Hi,
the python script doesn't work on my genbank files. "ValueError: Need a Nucleotide or Protein alphabet" Anyway.. If I have several genbank files can I use the function: Code:
in_genbank = open(sys.argv[2], 'r') out_genbank = open(sys.argv[3], 'w') Code:
recordA = SeqIO.read("fileA.gbk", "genbank") recordB = SeqIO.read("fileB.gbk", "genbank") SeqIO.write(combined_record, "combined.gbk", "genbank") Code:
$python gbk_merge.py *.gbk out.gbk |
![]() |
![]() |
![]() |
#9 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
Normally the first line (the LOCUS line) of a GenBank file says if the record is DNA or protein - and Biopython uses that information. If this failed I would expect this error on trying to save as GenBank format:
"ValueError: Need a Nucleotide or Protein alphabet" Can you show us the start of your GenBank records? e.g. at the Linux terminal prompt try: Code:
$ head fileA.gbk Last edited by maubp; 08-29-2016 at 02:53 AM. Reason: Markdown doesn't work here, need to use forum's markup |
![]() |
![]() |
![]() |
#10 |
Member
Location: Paris Join Date: Feb 2014
Posts: 36
|
![]()
This is the head:
$ head fileA.gbk LOCUS DG_Contig_04 DEFINITION seqnum=1;seqlen=498669;seqhdr="DG_final_Contig_04";version=Prodigal.v2.50;run_type=Single;model="Ab initio";gc_cont=46.95;transl_table=11;uses_sd=1 FEATURES Location/Qualifiers gene complement(3..413) /ID="LPDG_0036" /locus_tag="LPDG_0036" /note="ID=1_1;partial=10;start_type=ATG;rbs_motif=AGxAGG/AGGxGG;rbs_spacer=5-10bp;score=21.71;cscore=5.19;sscore=16.52;rscore=10.95;uscore=2.29;tscore=3.94" CDS complement(3..413) /ID="LPDG_0036" /locus_tag="LPDG_0036" $head fileB.gbk LOCUS DG_Contig_05 DEFINITION seqnum=1;seqlen=45471;seqhdr="DG_final_Contig_05";version=Prodigal.v2.50;run_type=Single;model="Ab initio";gc_cont=46.03;transl_table=11;uses_sd=1 FEATURES Location/Qualifiers gene 3..176 /ID="LPDG_0520" /locus_tag="LPDG_0520" /note="ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;score=1.94;cscore=-0.73;sscore=2.68;rscore=0.00;uscore=3.18;tscore=0.00" CDS 3..176 /ID="LPDG_0520" /locus_tag="LPDG_0520" |
![]() |
![]() |
![]() |
#11 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
You need to use the forums's [ code ] and [ /code ] tags (without spaces) in order to preserve the white space - but this confirms you don't have proper GenBank files.
Biopython should be giving you a warning, but the workaround is to tell the parser these are DNA sequences. Does this work for you with two files? Code:
from Bio.Alphabet import generic_dna from Bio import SeqIO recordA = SeqIO.read("fileA.gbk", "genbank", generic_dna) recordB = SeqIO.read("fileB.gbk", "genbank", generic_dna) combined_record = recordA + ("N" * 50) + recordB SeqIO.write(combined_record, "combined.gbk", "genbank") |
![]() |
![]() |
![]() |
#12 |
Member
Location: Paris Join Date: Feb 2014
Posts: 36
|
![]()
now I receive this error:
Code:
ValueError: Specified alphabet DNAAlphabet() clashes with that determined from the file, Alphabet() |
![]() |
![]() |
![]() |
#13 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
Can you email me the two input files so I can test this locally (on GoogleMail - I'll message you via the forum)? Also please confirm which version of Biopython you have.
|
![]() |
![]() |
![]() |
#14 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
Thanks for emailing the files. Hmm. Where did this "GenBank" files come from? That would be useful for us to know (e.g. for other people using the same tool), but right now my suggestion for setting the alphabet does not work.
Workaround (tested): Code:
from Bio.Alphabet import generic_dna from Bio import SeqIO recordA = SeqIO.read("fileA.gbk", "genbank") recordA.seq.alphabet = generic_dna recordB = SeqIO.read("fileB.gbk", "genbank") recordB.seq.alphabet = generic_dna combined_record = recordA + ("N" * 50) + recordB SeqIO.write(combined_record, "combined.gbk", "genbank") |
![]() |
![]() |
![]() |
#15 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
Version which can be used on multiple files. In order to keep the interface simple this just takes input filenames and writes the record to stdout, e.g.
Code:
$ python merge_gbk.py fileA.gbk fileB.gbk > combined.gbk Loading fileA.gbk Loading fileB.gbk Merging into one record Code:
$ python merge_gbk.py input_folder/*.gbk > output_folder/combined.gbk Code:
#!/usr/bin/env python # Copyright 2016 Peter Cock, James Hutton Institute # This example Biopython script is released for free re-use under # the MIT license https://opensource.org/licenses/MIT # For backgound see http://seqanswers.com/forums/showthread.php?t=28924 import sys from Bio.Alphabet import generic_dna from Bio import SeqIO if len(sys.argv) < 2: print("Usage: Takes as arguments filenames for input DNA GenBank files,") print("outputs merged record to stdout using 50bp runs of N as linker.") print("") print("python merge_gbk.py input*.gbk > combined.gbk") # argv[0] = python script # argv[1] = first record # argv[2] onwards = subsequence records filename = sys.argv[1] sys.stderr.write("Loading %s\n" % filename) combined_record = SeqIO.read(filename, "genbank") # Forcing DNA alphabet in case of improper input files combined_record.seq.alphabet = generic_dna for filename in sys.argv[2:]: sys.stderr.write("Loading %s\n" % filename) record = SeqIO.read(filename, "genbank") record.seq.alphabet = generic_dna combined_record = combined_record + ("N" * 50) + record sys.stderr.write("Merging into one record\n") count = SeqIO.write(combined_record, sys.stdout, "genbank") assert count == 1, count Last edited by maubp; 08-29-2016 at 03:56 AM. Reason: Copyright statement for script |
![]() |
![]() |
![]() |
Tags |
annotation, artemis, cat, genbank files, merge |
Thread Tools | |
|
|