![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to remove all spaces in the fasta file? | maryam | RNA Sequencing | 1 | 12-29-2015 12:47 PM |
Remove header from BAM file | tahamasoodi | Bioinformatics | 10 | 04-15-2013 12:07 PM |
[PERL] Compare two sequences from fasta file | frenchcookie | Bioinformatics | 3 | 12-17-2012 08:59 AM |
perl : Remove redundant feature in fasta file | StephaniePi83 | Bioinformatics | 9 | 12-15-2012 07:01 PM |
Perl: get specific base from FASTA file. | njh_TO | Bioinformatics | 6 | 02-02-2012 06:34 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: New York Join Date: May 2016
Posts: 1
|
![]()
I did a script in Perl that breaks several sequences of a multifasta file, but I need remove a part of string of header.
For example: input file: >gi|983431797|ref|NZ_LN868938.1| Nocardia farcinica genome assembly NCTC11134, chromosome : 1 CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT Output file: >Nocardia farcinica genome assembly NCTC11134, chromosome : 1 CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT I've done everything, but I can't think of a solution to this. Can someone help me? My script: #!/usr/bin/perl use strict; use warnings; use IO::File; my $file = "\nFILE: perl $0 <Fasta>"."\n"; print $file and exit unless($ARGV[0]); my $input = IO::File->new("$ARGV[0]"), my $output; while(my $line = $input->getline){ chomp($line); if($line =~ /^>/){ $line =~ s/^>//; $output = IO::File->new("> $line.fa"); print $output ">".$line."\n"; }else{ print $output $line."\n"; } } close($input); close($output); |
![]() |
![]() |
![]() |
#2 |
Registered Vendor
Location: Eugene, OR Join Date: May 2013
Posts: 521
|
![]()
Can you just split on " " and print the second element of the split array? Or split on '|' and take the last element. It just depends on how standard the formatting is of the header.
@header_split = split(" ",$line); $changed_line = $header_split[1];
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com |
![]() |
![]() |
![]() |
#3 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
I would caution you that what you plan to do is potentially problematic. The generally accepted format for FASTA file deflines is that the first word after the ">" represents the unique identifier for the sequence. The "first word" is defined as everything up to the first "whitespace" which may be a space or tab character. Everything that comes after that is optional description text. If you also have included in your analysis: Code:
>gi|873551602|emb|LN868939.1| Nocardia farcinica genome assembly NCTC11134, plasmid : 2 GGCTTTGTGCCCGCCGAAAAAAGGTTGCCTATGTCCAAGCCTGCATTTACCGAAATCGACCGAATGACGG GCGGAGGGCGGAGTAATCGCACCCGCCCACCGGTCAACTTCCTTCTTCACACCGAGGAAGGAAACTCGAG... Code:
>Nocardia farcinica genome assembly NCTC11134, plasmid : 2 GGCTTTGTGCCCGCCGAAAAAAGGTTGCCTATGTCCAAGCCTGCATTTACCGAAATCGACCGAATGACGG GCGGAGGGCGGAGTAATCGCACCCGCCCACCGGTCAACTTCCTTCTTCACACCGAGGAAGGAAACTCGAG... |
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
One alternative would be to change all spaces to "_" so that you have a long string (that should stay unique) for each fasta header. It would be cumbersome but would at least avoid the problem @kmcarr pointed out.
|
![]() |
![]() |
![]() |
Tags |
perl sequences file |
Thread Tools | |
|
|