Seqanswers Leaderboard Ad

**SNPsaurus** · 05-12-2016, 03:57 PM

Can you just split on " " and print the second element of the split array? Or split on '|' and take the last element. It just depends on how standard the formatting is of the header.
@header_split = split(" ",$line);
$changed_line = $header_split[1];

**kmcarr** · 05-13-2016, 06:53 AM

Originally posted by Katty1 View Post

I did a script in Perl that breaks several sequences of a multifasta file, but I need remove a part of string of header.

For example:

input file:

Code:

>gi|983431797|ref|NZ_LN868938.1| Nocardia farcinica genome assembly NCTC11134, chromosome : 1
CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT

Output file:

Code:

>Nocardia farcinica genome assembly NCTC11134, chromosome : 1
CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT

Katty,

I would caution you that what you plan to do is potentially problematic. The generally accepted format for FASTA file deflines is that the first word after the ">" represents the unique identifier for the sequence. The "first word" is defined as everything up to the first "whitespace" which may be a space or tab character. Everything that comes after that is optional description text. If you also have included in your analysis:

Code:

>gi|873551602|emb|LN868939.1| Nocardia farcinica genome assembly NCTC11134, plasmid : 2
GGCTTTGTGCCCGCCGAAAAAAGGTTGCCTATGTCCAAGCCTGCATTTACCGAAATCGACCGAATGACGG
GCGGAGGGCGGAGTAATCGCACCCGCCCACCGGTCAACTTCCTTCTTCACACCGAGGAAGGAAACTCGAG...

Which you also edit to:

Code:

>Nocardia farcinica genome assembly NCTC11134, plasmid : 2
GGCTTTGTGCCCGCCGAAAAAAGGTTGCCTATGTCCAAGCCTGCATTTACCGAAATCGACCGAATGACGG
GCGGAGGGCGGAGTAATCGCACCCGCCCACCGGTCAACTTCCTTCTTCACACCGAGGAAGGAAACTCGAG...

You have two sequences as part of your analysis which share the same ID, "Nocardia".

**GenoMax** · 05-13-2016, 07:11 AM

One alternative would be to change all spaces to "_" so that you have a long string (that should stay unique) for each fasta header. It would be cumbersome but would at least avoid the problem @kmcarr pointed out.

Topics	Statistics	Last Post
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, Yesterday, 07:17 AM	0 responses 11 views 0 likes	Last Post by seqadmin Yesterday, 07:17 AM
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 20 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM

Seqanswers Leaderboard Ad

Announcement

How do I remove a part of the header of a fasta file in perl?

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News