Hi,
I have a problem concatenating files so that I can load them into Arlequin. I have a few thousand fasta (or nexus) files each representing a single gene. However, not all individuals are present in all files. I want to concatenate these in such a way, that missing genes are filled with Ns or ? or - and that genes are seperated by TAB. I have found many tools, but none allowed TAB seperation of loci. I would be happy for any ideas.
Examples:
Input 1
Neo001_0 CGTAAAAATTTGTTCCGAAACATA
Neo001_1 CGTAAAAATTTGTTCCGAAACATA
Neo003_0 CGTAAAAATTTGTTCCGAAACATA
Neo003_1 CGTAAAAATTTGTTCCGAAACATA
Neo004_0 CGTAAAAATTTGTTCCGAAACATA
Neo004_1 CGTAAAAATTTGTTCCGAAACATA
Input2
Neo001_0 CGATGTATTTGGTATCCTAC
Neo001_1 CGATGTATTTGGTATCCTAC
Neo002_0 CGATGTATTTGGTATCCTAC
Neo002_1 CGATGTATTTGGTATCCTAC
Neo003_0 CGATGTATTTGGTATCCTAC
Neo003_1 CGATGTATTTGGTATCCTAC
Neo004_0 CGATGTATTTGGTATCCTAC
Neo004_1 CGATGTATTTGGTATCCTAC
Expected output:
Expected output:
Neo001_0 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo001_1 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo002_0 NNNNNNNNNNNNNNNNNNNNNNNN CGATGTATTTGGTATCCTAC
Neo002_1 NNNNNNNNNNNNNNNNNNNNNNNN CGATGTATTTGGTATCCTAC
Neo003_0 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo003_1 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo004_0 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo004_1 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Many thanks,
Martin
I have a problem concatenating files so that I can load them into Arlequin. I have a few thousand fasta (or nexus) files each representing a single gene. However, not all individuals are present in all files. I want to concatenate these in such a way, that missing genes are filled with Ns or ? or - and that genes are seperated by TAB. I have found many tools, but none allowed TAB seperation of loci. I would be happy for any ideas.
Examples:
Input 1
Neo001_0 CGTAAAAATTTGTTCCGAAACATA
Neo001_1 CGTAAAAATTTGTTCCGAAACATA
Neo003_0 CGTAAAAATTTGTTCCGAAACATA
Neo003_1 CGTAAAAATTTGTTCCGAAACATA
Neo004_0 CGTAAAAATTTGTTCCGAAACATA
Neo004_1 CGTAAAAATTTGTTCCGAAACATA
Input2
Neo001_0 CGATGTATTTGGTATCCTAC
Neo001_1 CGATGTATTTGGTATCCTAC
Neo002_0 CGATGTATTTGGTATCCTAC
Neo002_1 CGATGTATTTGGTATCCTAC
Neo003_0 CGATGTATTTGGTATCCTAC
Neo003_1 CGATGTATTTGGTATCCTAC
Neo004_0 CGATGTATTTGGTATCCTAC
Neo004_1 CGATGTATTTGGTATCCTAC
Expected output:
Expected output:
Neo001_0 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo001_1 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo002_0 NNNNNNNNNNNNNNNNNNNNNNNN CGATGTATTTGGTATCCTAC
Neo002_1 NNNNNNNNNNNNNNNNNNNNNNNN CGATGTATTTGGTATCCTAC
Neo003_0 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo003_1 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo004_0 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Neo004_1 CGTAAAAATTTGTTCCGAAACATA CGATGTATTTGGTATCCTAC
Many thanks,
Martin
Comment