SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to remove all spaces in the fasta file? maryam RNA Sequencing 1 12-29-2015 12:47 PM
Remove header from BAM file tahamasoodi Bioinformatics 10 04-15-2013 12:07 PM
[PERL] Compare two sequences from fasta file frenchcookie Bioinformatics 3 12-17-2012 08:59 AM
perl : Remove redundant feature in fasta file StephaniePi83 Bioinformatics 9 12-15-2012 07:01 PM
Perl: get specific base from FASTA file. njh_TO Bioinformatics 6 02-02-2012 06:34 AM

Reply
 
Thread Tools
Old 05-12-2016, 04:38 PM   #1
Katty1
Junior Member
 
Location: New York

Join Date: May 2016
Posts: 1
Default How do I remove a part of the header of a fasta file in perl?

I did a script in Perl that breaks several sequences of a multifasta file, but I need remove a part of string of header.

For example:

input file:
>gi|983431797|ref|NZ_LN868938.1| Nocardia farcinica genome assembly NCTC11134, chromosome : 1
CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT

Output file:
>Nocardia farcinica genome assembly NCTC11134, chromosome : 1
CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT

I've done everything, but I can't think of a solution to this.
Can someone help me?

My script:

#!/usr/bin/perl

use strict;
use warnings;
use IO::File;

my $file = "\nFILE: perl $0 <Fasta>"."\n";
print $file and exit unless($ARGV[0]);
my $input = IO::File->new("$ARGV[0]"), my $output;
while(my $line = $input->getline){
chomp($line);
if($line =~ /^>/){
$line =~ s/^>//;
$output = IO::File->new("> $line.fa");
print $output ">".$line."\n";
}else{
print $output $line."\n";
}
}
close($input);
close($output);
Katty1 is offline   Reply With Quote
Old 05-12-2016, 04:57 PM   #2
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

Can you just split on " " and print the second element of the split array? Or split on '|' and take the last element. It just depends on how standard the formatting is of the header.
@header_split = split(" ",$line);
$changed_line = $header_split[1];
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 05-13-2016, 07:53 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by Katty1 View Post
I did a script in Perl that breaks several sequences of a multifasta file, but I need remove a part of string of header.

For example:

input file:
Code:
>gi|983431797|ref|NZ_LN868938.1| Nocardia farcinica genome assembly NCTC11134, chromosome : 1
CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT
Output file:
Code:
>Nocardia farcinica genome assembly NCTC11134, chromosome : 1
CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT
Katty,

I would caution you that what you plan to do is potentially problematic. The generally accepted format for FASTA file deflines is that the first word after the ">" represents the unique identifier for the sequence. The "first word" is defined as everything up to the first "whitespace" which may be a space or tab character. Everything that comes after that is optional description text. If you also have included in your analysis:

Code:
>gi|873551602|emb|LN868939.1| Nocardia farcinica genome assembly NCTC11134, plasmid : 2
GGCTTTGTGCCCGCCGAAAAAAGGTTGCCTATGTCCAAGCCTGCATTTACCGAAATCGACCGAATGACGG
GCGGAGGGCGGAGTAATCGCACCCGCCCACCGGTCAACTTCCTTCTTCACACCGAGGAAGGAAACTCGAG...
Which you also edit to:

Code:
>Nocardia farcinica genome assembly NCTC11134, plasmid : 2
GGCTTTGTGCCCGCCGAAAAAAGGTTGCCTATGTCCAAGCCTGCATTTACCGAAATCGACCGAATGACGG
GCGGAGGGCGGAGTAATCGCACCCGCCCACCGGTCAACTTCCTTCTTCACACCGAGGAAGGAAACTCGAG...
You have two sequences as part of your analysis which share the same ID, "Nocardia".
kmcarr is offline   Reply With Quote
Old 05-13-2016, 08:11 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,088
Default

One alternative would be to change all spaces to "_" so that you have a long string (that should stay unique) for each fasta header. It would be cumbersome but would at least avoid the problem @kmcarr pointed out.
GenoMax is offline   Reply With Quote
Reply

Tags
perl sequences file

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:25 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO