SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting alternate loci from fasta genome Jane M Bioinformatics 1 08-18-2017 09:37 AM
Tab delimited text files of gene counts ronaldrcutler Bioinformatics 6 06-17-2016 09:48 AM
shell script for concatenating fastq files JQL Metagenomics 2 04-21-2016 02:13 PM
Concatenating Sequences Within a single Fasta File cdlam Bioinformatics 3 12-04-2012 08:54 AM
merging a tab and a fasta file arg General 2 10-21-2010 11:53 AM

Reply
 
Thread Tools
Old 02-12-2020, 05:22 AM   #1
schwentner
Junior Member
 
Location: Austria

Join Date: Feb 2020
Posts: 1
Default concatenating fasta files with TAB seperating loci

Hi,

I have a problem concatenating files so that I can load them into Arlequin. I have a few thousand fasta (or nexus) files each representing a single gene. However, not all individuals are present in all files. I want to concatenate these in such a way, that missing genes are filled with Ns or ? or - and that genes are seperated by TAB. I have found many tools, but none allowed TAB seperation of loci. I would be happy for any ideas.

Examples:

Input 1
Neo001_0 CGTAAAAATTTGTTCCGAAACATA
Neo001_1 CGTAAAAATTTGTTCCGAAACATA
Neo003_0 CGTAAAAATTTGTTCCGAAACATA
Neo003_1 CGTAAAAATTTGTTCCGAAACATA
Neo004_0 CGTAAAAATTTGTTCCGAAACATA
Neo004_1 CGTAAAAATTTGTTCCGAAACATA


Input2
Neo001_0 CGATGTATTTGGTATCCTAC
Neo001_1 CGATGTATTTGGTATCCTAC
Neo002_0 CGATGTATTTGGTATCCTAC
Neo002_1 CGATGTATTTGGTATCCTAC
Neo003_0 CGATGTATTTGGTATCCTAC
Neo003_1 CGATGTATTTGGTATCCTAC
Neo004_0 CGATGTATTTGGTATCCTAC
Neo004_1 CGATGTATTTGGTATCCTAC




Expected output:
Expected output:
Neo001_0 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo001_1 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo002_0 NNNNNNNNNNNNNNNNNNNNNNNN CGATGTATTTGGTATCCTAC
Neo002_1 NNNNNNNNNNNNNNNNNNNNNNNN CGATGTATTTGGTATCCTAC
Neo003_0 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo003_1 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo004_0 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo004_1 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC


Many thanks,
Martin
schwentner is offline   Reply With Quote
Old 02-12-2020, 12:16 PM   #2
ATϟGC
Member
 
Location: Canada

Join Date: Jun 2013
Posts: 52
Default

The concatenate function of Seqkit [https://bioinf.shenwei.me/seqkit/usage/#concat] will concatenate sequences with matching ID's from two fasta files.

I converted your inputs into fasta format and the following command concatenated matching ID's together.

$seqkit concat input1.txt input2.txt >output.txt

But it did not add the N's and there is no space delimiting the two sequences. Maybe you can get it to add a delimiter.
ATϟGC is offline   Reply With Quote
Old 02-13-2020, 09:45 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,173
Default

Sometimes the old solutions are the best solutions, the good ol' Unix join command:

Code:
join -e 'NNNNNNNNNNNNNNNNNNNNNNNN' -1 1 -2 1 -a 1 -a 2 -o '0,1.2,2.2' Input1.txt Input2.txt
Assumptions
  • Your input files as shown are named Input1.txt and Input2.txt
  • The input files have both been sorted prior to running the join command

Read the man page for join to learn what all the various command line parameters are doing.
kmcarr is offline   Reply With Quote
Reply

Tags
concatenated genes, fasta, nexus

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:27 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO