Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with FASTA parsing code.

    Hi guys!

    I'm working on a project where I am reading in values from a text file and then using them as search terms in a FASTA file. Ideally every time the program finds encounters the search term in the header section of the FASTA file, it should copy the header section of the FASTA file, along with the associated sequence into an output file. I've included examples of the kinds of data I am working with, along with the code I'm currently working on so you can get a better sense of what I am talking about.

    Currently the problem I've been having with my code is that when I run it, it only outputs the results of the search conducted with the final term in the text file. It's that the program ran through the entire text file without performing any searches on the file to be searched, before stopping on the last item in the text file and performing a search. I should warn you that I just started programming in perl recently, so it is quite possible that my code has a few glaring errors that I've missed.

    #Data from the text file I am using for search terms. Note these sequence ID names represent a subset of the sequences found in the FASTA file I am searching through. The actual text file is much larger, containing 1000s of terms.
    BF01013B2E03.f1
    BF01010B1A11.f1
    BF01028B1E03.f1
    BF01029A2C12.f1
    BF01028B2D11.f1

    #Data from input FASTA file. I am searching through this file using th +e items from the text file shown above. The actual input FASTA file contains 1000s more items
    > BF01013B2E03.f1 735 1 735 ABI CAATCCAAGAACATTTTGAAGAAAAAATCTCTTAAAAAAAAGAAATCAAAACAAGTGATCAAAAATGAAATGAATGGTCA
    > BF01010B1A11.f1 782 1 782 ABI AACGGACNANNCGGCAACCAGGAGGCCTTCCAAGCTGAACTGGGAGAGTGGATCAAGAAG
    > BF01028B2D11.f1 674 1 674 ABI CCAGCACNNNNTNAGATATTAGCCTAGCCTCTATGTCGTATTTGTATTTCNNCTAGTTTTTCATCCGACTTTTTTGGATC

    #output file note the output file has only one sequence in it. This is clearly not what I want. Something is wrong, I just can't put my finger on it. > BF01028B2D11.f1 674 1 674 ABI CCAGCACNNNNTNAGATATTAGCCTAGCCTCTATGTCGTATTTGTATTTCNNCTAGTTTTTCATCCGACTTTTTTGGATC #perl program
    <code>
    #!/usr/bin/perl
    use strict;
    use File::Basename;
    my $database;
    my $data = shift(@ARGV);
    my $input_file = shift(@ARGV);
    my $infile;
    my $match; my $output_file;
    my $ESTs_W_SNPs;
    $output_file = "ESTsWsnpsANDgoodCoverage.fasta";
    open ($database,'<', $data) or die "Cannot open $data\n";
    while ($match = <$database>)
    {
    chomp $match;
    open (IN, $input_file) || die "Can't open input file. Please provide a valid input filename.\n";
    open ($ESTs_W_SNPs, '>', $output_file) or die "Cannot write to $output_file\n";
    my ($seq, $prevhead) = (0, "", '');
    while(<IN>)
    {
    my $line = $_;
    $line =~ s/\r\n/\n/;
    chomp $line;
    $seq.= uc($line) if(eof(IN));
    if (/\>(.*)/ || eof(IN))
    {
    my $head=$1;
    printf $ESTs_W_SNPs ">$prevhead\n$seq\n" if($prevhead =~ /$ma +tch/); $prevhead = $head; $seq='';
    }
    else
    {
    $seq.=$line;
    }
    }
    close (IN);
    close ($ESTs_W_SNPs);
    }

  • #2
    You are re-opening the output file every time you grab a new input from your database file (the outermost while loop) which basically reinitializes the file. You need to move the open/close statements for the output file outside your main program loop.

    Also, you approach is not the most efficient. You are scanning through your entire FASTA file for each record you want to save. A better approach is to read in your search list and store it in a hash. Then go through your FASTA file just once, comparing each record to your list (hash). If the ID is in your hash then write the sequence to your output file. If the ID is not in your list then move on to the next record.

    Comment


    • #3
      So, uh, wouldn't you just do
      Code:
      grep -F -f file1 -A 1 file2
      where file1 is your search terms and file2 is the full FASTA file?

      See the man page for GNU grep for more info.

      Comment


      • #4
        Originally posted by chalex View Post
        So, uh, wouldn't you just do
        Code:
        grep -F -f file1 -A 1 file2
        where file1 is your search terms and file2 is the full FASTA file?

        See the man page for GNU grep for more info.
        You're method only copies the first line of sequence after the definition line. There is no way to tell how many lines of sequence are follow a given definition line so you need some way to examine the lines following your matched definition.

        Comment


        • #5
          Originally posted by kmcarr View Post
          You're method only copies the first line of sequence after the definition line. There is no way to tell how many lines of sequence are follow a given definition line so you need some way to examine the lines following your matched definition.
          Funny, I was just working on this problem today. I quality filtered some Illumina PE data (reads 1 and 2 in separate files) and after that they didn't have exactly matched reads in them anymore. I needed to produce new files for each read containing only those reads with a partner in the other quality filtered file. Eventually I figured out that grep does work for this if you tell it to pull only the 4 lines after the read ID match:

          Code:
               my $lineone = `grep -n $readID $read_file_name]`;
               my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
               $linenum = $linenum +3;
               my @printone = `head -$linenum $read_file_name | tail -4`;

          Comment


          • #6
            tools do exist

            Is there a reason not to use the pattern of,
            indexing the sequence fasta file with whatever blast like tool
            you are use to, then feeding the set of identifiers to the command
            that extracts the fasta records from the indexed sequence?
            especially if you may want to partition out other subset of sequence.
            something like ...

            classic ncbi blast
            Code:
            formatdb  -p F -o T -i sequence.fasta -n blastdb
            fastacmd  -d blastdb  -i accession.list  -o isolated.sequences
            wublast
            Code:
            xdformat -n -I -o blastdb sequence.fasta  
            xdget -n -f  blastdb  accession.list > isolated.sequences
            new blast+ (have never used yet)
            Code:
            makeblastdb -in sequence.fasta -dbtype nucl  -hash_index  - parse_seqids  -out blastdb
            blastdbcmd  -entry_batch accession.list -db blastdb -dbtype nucl
            nor I have been using the newer tools, bowtie, bwa ...
            long enough to to know if they have a similar mode
            but it would not surprise me.

            Comment


            • #7
              pyfasta http://pypi.python.org/pypi/pyfasta/

              or flatten the fasta so that each sequence only occupies a single line then use grep?
              Last edited by Seqasaurus; 03-28-2011, 02:39 PM. Reason: link for pyfasta

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              23 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              21 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X