Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • andreitudor
    Member
    • Feb 2011
    • 21

    Extract sequence from multi fasta file with PERL

    Hello,

    I want to make a script that will extract a sequence from a multifasta file. I have made a file with 19 genomes from NCBI and I want to be able to extract a genome that I need from it. I have played a little with Bio::SeqIO and learned how to create a object that has the hole file in it and then right the whole information in another file.
    What i do not know how to do is how to select only the fasta sequence from the genome that I want and write it onto a file.
    I think I have to create a third object that would contain only that genome, but I still don't know how to read that part from my sequence object...

    Thanks in advance.
  • boetsie
    Senior Member
    • Feb 2010
    • 245

    #2
    Hi,

    I've attached a script which can do this. If i understand it correctly you have a file like;

    >chr1
    AGCTGATGATAGT...
    >chr2
    ACAAAATAGTCGAT....
    >chr3
    ....

    And your perl script would be something like;

    perl extractSequence.pl genomefile.fa chr1

    where 'chr1' corresponds to a sequence named chr1 (indicated by chr1)?

    Say you have a more complicated file like;

    >chr1_coverage1000_length100
    AGATGTATGTTAGA

    You can do something like;

    perl extractSequence.pl genomefile.fa chr1_.

    which will extract all the sequences containing the header chr1_

    To store the results, do;

    perl extractSequence.pl genomefile.fa chr1 > filename.txt

    If this is what you want, you can use my script.

    Boetsie
    Attached Files

    Comment

    • maasha
      Senior Member
      • Apr 2009
      • 153

      #3
      This can be done with Biopieces (www.biopieces.org) like this:

      Code:
      read_fasta -i ncbi_genomes.fna | grab -p my_genome | write_fasta -o my_genome.fna -x

      Cheers


      Martin

      Comment

      • NM_010117
        Member
        • Oct 2010
        • 14

        #4
        This script might be a little easier to wrap your head around (there's less code anyway) and does the same thing as boetsie's:

        Code:
        #!/usr/bin/perl -w
        use strict;
        my $file = shift;
        my $pattern = shift;
        my $data = `cat $file`;
        my ($fa) = $data =~ /(>$pattern[^>]+)/s;
        print $fa;
        Mind you boetsie's implementation has the benefit of not having to read in the entire file every time (i.e. her implementation "slurps" in the file to avoid needing to use a large amount of memory). However, this of course isn't an issue with small files (e.g. bacterial genomes).

        Comment

        • ssully
          Member
          • Aug 2010
          • 48

          #5
          another way (if you're already set up for local BLAST searches) -- use 'fastacmd' to extract whatever records you need from a fasta db you've created using 'formatdb'

          (this refers to the old NCBI BLAST -- the newer NCBI 'blast+' uses different but analogous toolset:

          http://www.ncbi.nlm.nih.gov/staff/ta.../pc_setup.html)

          Comment

          • apc2010
            Junior Member
            • Jan 2010
            • 1

            #6
            If you need sequences extracted from a multi-FASTA and are open to using a pre-existing tool, I would also suggest either the faSomeRecords or faOneRecord command line utilities from UCSC.

            They have versions of this tool for OSX and Linux. Here is a link to the executable downloads:



            The difference between the two: faOneRecord takes the sequence name to extract from the command line, faSomeRecords reads in a file of 1 or more sequence names to extract from the multi-FASTA.

            Usage:
            Code:
            ================================================================
            ========   faOneRecord   ====================================
            ================================================================
            faOneRecord - Extract a single record from a .FA file
            usage:
               faOneRecord in.fa recordName
            
            ================================================================
            ========   faSomeRecords   ====================================
            ================================================================
            faSomeRecords - Extract multiple fa records
            usage:
               faSomeRecords in.fa listFile out.fa
            options:
               -exclude - output sequences not in the list file.

            Comment

            • kmcarr
              Senior Member
              • May 2008
              • 1181

              #7
              There was a previous thread which explained how to do this with the BLAST tools (old and new)

              Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


              The problem with using Bio::SeqIO (or a simple perl script in general) is that there is no indexing of the file. This means that to pull out a specific sequence you have to scan through the file from the beginning until you find the sequence you are interested in. If your input file isn't too huge and you only need to do it for a small number of sequences or very infrequently that would work. But if the input file is ginormous (e.g. if your 19 genomes from NCBI are all big, mammalian genomes) and you want to repeatedly extract individual sequences then an indexed method like the BLAST tools is advisable.
              Last edited by kmcarr; 02-16-2011, 01:02 PM.

              Comment

              • krobison
                Senior Member
                • Nov 2007
                • 734

                #8
                There's also Bio:B::Fasta module for Perl. One additional advantage to this tool is that you can easily get a specific range of a sequence as a subsequence.

                I'm guilty of frequently doing this as a one-liner -- slow on a large database and a pain for a long list of sequences (or if you get the regex wrong), but sometimes it's easier than remembering a more specific tool.

                Code:
                perl -n -e '$on=(/^>(idA|idB|idC)/) if (/^>/); print $_ if ($on);' sequences.fa

                Comment

                • andreitudor
                  Member
                  • Feb 2011
                  • 21

                  #9
                  Hey guys,

                  Thanks for your answers. I do not want to use the tools provided from blast+ because I want to do it on my own. I looked into BIO:: DB:Fasta and I decided to use it, since it has a function that indexes your fasta file and then you can search by ID.
                  Now I have one problem, when I create the indexed file BIO:: DB:Fasta gives me this error :

                  indexing was interrupted, so unlinking reference.fna.index at /usr/share/perl5/Bio/DB/Fasta.pm line 1053.
                  Was not able to open files!!
                  Full error is


                  ------------- EXCEPTION: Bio::Root::Exception -------------
                  MSG: Each line of the fasta entry must be the same length except the last.
                  Line above #46357 '
                  ..' is 69 != 71 chars.
                  STACK: Error::throw
                  STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:368
                  STACK: Bio:B::Fasta::calculate_offsets /usr/share/perl5/Bio/DB/Fasta.pm:770
                  STACK: Bio:B::Fasta::index_file /usr/share/perl5/Bio/DB/Fasta.pm:680
                  STACK: Bio:B::Fasta::new /usr/share/perl5/Bio/DB/Fasta.pm:491
                  STACK: extract.pl:22
                  -----------------------------------------------------------

                  From this I understand that the lines in my fasta file are not equal. The only problem is that the line that is not the same length is the ID line. What can I do about this??

                  Andrei

                  Comment

                  • krobison
                    Senior Member
                    • Nov 2007
                    • 734

                    #10
                    Check the line lengths; the program is complaining that there is a sequence line of different length; it could well be some stray spaces; take a look at line #46357 with a text editor

                    This little program will find the numbers for all your "illegal" lines
                    [CODE]
                    #!/usr/bin/perl
                    use strict;
                    ## find & print line number for all "illegal" FASTA lines
                    my %lengths=();
                    my $lastId=undef;
                    my $lineCount=0;
                    my $len=undef;
                    while ($_ = <>)
                    {
                    $lineCount++;
                    chomp;
                    if (/^>(\S+)/)
                    {
                    $lastId=$1;
                    if (defined $len) # last line of entry allowed to be ragged
                    {
                    delete $lengths{$len}->{$lineCount-1};
                    }
                    }
                    else
                    {
                    $len=length($_);
                    $lengths{$len}={} unless (ref $lengths{$len});
                    $lengths{$len}->{$lineCount}=$lastId;
                    }
                    }

                    my @lengths=sort {scalar(keys %{$lengths{$b}})<=>scalar(keys %{$lengths{$a}})} keys %lengths;
                    my $modalLength=shift(@lengths);
                    print "# modalLength=$modalLength ",scalar(keys %{$lengths{$modalLength}}),"\n";
                    foreach my $len(@lengths)
                    {
                    foreach my $lineCount(sort {$a<=>$b} keys %{$lengths{$len}})
                    {
                    print join("\t",$len,$lineCount,$lengths{$len}->{$lineCount}),"\n";
                    }
                    }
                    [/CODE

                    Comment

                    • mghita
                      Member
                      • Aug 2011
                      • 10

                      #11
                      Hi

                      Hi,

                      I might sound dumb to people around here , but how do I install/open/use the command? I tried faSomeRecords in.fa listFile out.fa with my files and it says "-bash: faSomeRecords: command not found", I added it to my path too. I'm new in the area and have no clue

                      Madalina



                      Originally posted by apc2010 View Post
                      If you need sequences extracted from a multi-FASTA and are open to using a pre-existing tool, I would also suggest either the faSomeRecords or faOneRecord command line utilities from UCSC.

                      They have versions of this tool for OSX and Linux. Here is a link to the executable downloads:



                      The difference between the two: faOneRecord takes the sequence name to extract from the command line, faSomeRecords reads in a file of 1 or more sequence names to extract from the multi-FASTA.

                      Usage:
                      Code:
                      ================================================================
                      ========   faOneRecord   ====================================
                      ================================================================
                      faOneRecord - Extract a single record from a .FA file
                      usage:
                         faOneRecord in.fa recordName
                      
                      ================================================================
                      ========   faSomeRecords   ====================================
                      ================================================================
                      faSomeRecords - Extract multiple fa records
                      usage:
                         faSomeRecords in.fa listFile out.fa
                      options:
                         -exclude - output sequences not in the list file.

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        #12
                        Originally posted by mghita View Post
                        Hi,

                        I might sound dumb to people around here , but how do I install/open/use the command? I tried faSomeRecords in.fa listFile out.fa with my files and it says "-bash: faSomeRecords: command not found", I added it to my path too. I'm new in the area and have no clue

                        Madalina
                        Madalina,

                        I will assume that you have downloaded the correct version of the compiled program for your operating system and saved it in a folder (or home dir).

                        You may need to make this file executable by adding the correct unix permission: chmod u+x /path_to/faSomeRecords (if you are in the same directory where this program is then use chmod u+x ./faSomeRecords).

                        Now you should be able to execute the file by following the example form earlier post.
                        e.g. /path_to/faSomeRecords your_data_file

                        Comment

                        • mghita
                          Member
                          • Aug 2011
                          • 10

                          #13
                          I have added the program to my path and I set the permission right, but now I have another issue:
                          "You need the Rosetta software to run faSomeRecords. The Rosetta installer is in Optional Installs on your Mac OS X installation disc."

                          and I don't have Rosetta installed, or the CD for installation, so I don't know how to handle this problem. Any suggestions?


                          Thanks,
                          Madalina


                          Originally posted by GenoMax View Post
                          Madalina,

                          I will assume that you have downloaded the correct version of the compiled program for your operating system and saved it in a folder (or home dir).

                          You may need to make this file executable by adding the correct unix permission: chmod u+x /path_to/faSomeRecords (if you are in the same directory where this program is then use chmod u+x ./faSomeRecords).

                          Now you should be able to execute the file by following the example form earlier post.
                          e.g. /path_to/faSomeRecords your_data_file

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            Originally posted by mghita View Post
                            I have added the program to my path and I set the permission right, but now I have another issue:
                            "You need the Rosetta software to run faSomeRecords. The Rosetta installer is in Optional Installs on your Mac OS X installation disc."

                            and I don't have Rosetta installed, or the CD for installation, so I don't know how to handle this problem. Any suggestions?


                            Thanks,
                            Madalina
                            Madalina,

                            If you are connected to the internet you should automatically be offered the option to download rosetta and install it.

                            Do you have a PowerPC- or an intel-based Mac? What OS are you running?

                            Comment

                            • mghita
                              Member
                              • Aug 2011
                              • 10

                              #15
                              Originally posted by GenoMax View Post
                              Madalina,

                              If you are connected to the internet you should automatically be offered the option to download rosetta and install it.

                              Do you have a PowerPC- or an intel-based Mac? What OS are you running?


                              I have Mac OS X 10.6.8, 3.06 GHz. I just get that message in bash, I don't get any install option. I tried to download it, but it doesn't work.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...