Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Katty1
    Junior Member
    • May 2016
    • 1

    How do I remove a part of the header of a fasta file in perl?

    I did a script in Perl that breaks several sequences of a multifasta file, but I need remove a part of string of header.

    For example:

    input file:
    >gi|983431797|ref|NZ_LN868938.1| Nocardia farcinica genome assembly NCTC11134, chromosome : 1
    CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
    CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT

    Output file:
    >Nocardia farcinica genome assembly NCTC11134, chromosome : 1
    CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
    CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT

    I've done everything, but I can't think of a solution to this.
    Can someone help me?

    My script:

    #!/usr/bin/perl

    use strict;
    use warnings;
    use IO::File;

    my $file = "\nFILE: perl $0 <Fasta>"."\n";
    print $file and exit unless($ARGV[0]);
    my $input = IO::File->new("$ARGV[0]"), my $output;
    while(my $line = $input->getline){
    chomp($line);
    if($line =~ /^>/){
    $line =~ s/^>//;
    $output = IO::File->new("> $line.fa");
    print $output ">".$line."\n";
    }else{
    print $output $line."\n";
    }
    }
    close($input);
    close($output);
  • SNPsaurus
    Registered Vendor
    • May 2013
    • 525

    #2
    Can you just split on " " and print the second element of the split array? Or split on '|' and take the last element. It just depends on how standard the formatting is of the header.
    @header_split = split(" ",$line);
    $changed_line = $header_split[1];
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment

    • kmcarr
      Senior Member
      • May 2008
      • 1181

      #3
      Originally posted by Katty1 View Post
      I did a script in Perl that breaks several sequences of a multifasta file, but I need remove a part of string of header.

      For example:

      input file:
      Code:
      >gi|983431797|ref|NZ_LN868938.1| Nocardia farcinica genome assembly NCTC11134, chromosome : 1
      CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
      CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT
      Output file:
      Code:
      >Nocardia farcinica genome assembly NCTC11134, chromosome : 1
      CTGACTGGGAGTACGAAGGCCGCCTGCACAAGACAACGGGGCAGCGAACCTTCTTCTGCACCGGCACGGA
      CGACGCCGAGATGCCTCGACCTGGAGAACCTCGGCCGCGGCGAACCGCTCGCCCATGTCCGCGCCGAGTT
      Katty,

      I would caution you that what you plan to do is potentially problematic. The generally accepted format for FASTA file deflines is that the first word after the ">" represents the unique identifier for the sequence. The "first word" is defined as everything up to the first "whitespace" which may be a space or tab character. Everything that comes after that is optional description text. If you also have included in your analysis:

      Code:
      >gi|873551602|emb|LN868939.1| Nocardia farcinica genome assembly NCTC11134, plasmid : 2
      GGCTTTGTGCCCGCCGAAAAAAGGTTGCCTATGTCCAAGCCTGCATTTACCGAAATCGACCGAATGACGG
      GCGGAGGGCGGAGTAATCGCACCCGCCCACCGGTCAACTTCCTTCTTCACACCGAGGAAGGAAACTCGAG...
      Which you also edit to:

      Code:
      >Nocardia farcinica genome assembly NCTC11134, plasmid : 2
      GGCTTTGTGCCCGCCGAAAAAAGGTTGCCTATGTCCAAGCCTGCATTTACCGAAATCGACCGAATGACGG
      GCGGAGGGCGGAGTAATCGCACCCGCCCACCGGTCAACTTCCTTCTTCACACCGAGGAAGGAAACTCGAG...
      You have two sequences as part of your analysis which share the same ID, "Nocardia".

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        One alternative would be to change all spaces to "_" so that you have a long string (that should stay unique) for each fasta header. It would be cumbersome but would at least avoid the problem @kmcarr pointed out.

        Comment

        Latest Articles

        Collapse

        • GATTACAT
          Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by GATTACAT
          Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
          Yesterday, 11:43 AM
        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, Today, 11:08 AM
        0 responses
        6 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-30-2026, 05:37 AM
        0 responses
        11 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-26-2026, 11:10 AM
        0 responses
        18 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        53 views
        0 reactions
        Last Post SEQadmin2  
        Working...