Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • clarissaboschi
    Member
    • Apr 2010
    • 63

    How to extract the longest sequence from a fasta file

    I have a fasta file with many sequences, an I would like to get just the longest one (does not matter the id). How can I do that?

    I would like to have as output a fasta file with only the longest sequence.

    thanks!
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    You can make a read length histogram with BBMap's readlength.sh (or reformat.sh lhist), then run:

    Code:
    reformat.sh in=file.fa out=longest.fa minlength=X

    Comment

    • pmiguel
      Senior Member
      • Aug 2008
      • 2328

      #3
      Must. Not. Write. Perl. One. Liner.

      --
      Phillip

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        Originally posted by clarissaboschi View Post
        I have a fasta file with many sequences, an I would like to get just the longest one (does not matter the id). How can I do that?
        What about the case where there is more than one sequence which match the longest length? Save all? Only the first or last?

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #5
          Originally posted by pmiguel View Post
          Must. Not. Write. Perl. One. Liner.

          --
          Phillip
          Go right ahead

          Originally posted by kmcarr
          What about the case where there is more than one sequence which match the longest length? Save all? Only the first or last?
          These are interesting questions that make me wonder whether the original problem is worth solving. I'm guessing the OP wants the main chromosome from a perfect bacterial assembly, but perhaps the Clarissa should chime in on what she's trying to do.

          Comment

          • Biocomputronics
            Junior Member
            • Sep 2017
            • 6

            #6
            This sort of task is nearly trivial in Python, it is well worth your while to learn it and the important packages such as BioPython which make jokes like seen in this thread all the more funny.

            So here is a quick and dirty way. A little thought could make it more elegant and/or generalizable. This would just parse through your multifasta file and spit back the longest sequence.

            Code:
            from Bio import SeqIO
            
            myList = []
            
            for seq_record in SeqIO.parse("test.fa", "fasta"):
                myList.append([seq_record.id, str(seq_record.seq), len(seq_record)])
            
            myList.sort(key=lambda x: x[2])
            
            print("the longest sequence is:")
            print(">", myList[-1][0], sep='')
            print(myList[-1][1])

            Comment

            • clarissaboschi
              Member
              • Apr 2010
              • 63

              #7
              Thanks for the suggestions.
              I have a list of protein ids, and I need to retrieve the respective nucleotide sequence. I am using eutils to retrieve them based on these ids, but for some ids, I have many sequences for one id. So for this specific case, I am selecting the longest sequence.

              Comment

              • Richard Finney
                Senior Member
                • Feb 2009
                • 701

                #8
                /*
                how to compile: gcc -Wall -O2 -o longestseq longestseq.c
                example: ./longestseq example.fa

                */

                #include <stdlib.h>
                #include <string.h>
                #include <stdio.h>

                char s[100000]; // input buffer
                int main(int argc, char *argv[])
                {
                FILE *fp;
                int i;
                size_t cur_spot;
                size_t last_spot;
                size_t best_spot;
                size_t cursize;
                size_t bestsofarsize;

                if (argc != 2) { fprintf(stderr,"Error: need inputfilefasta. This program finds first longest fasta sequence(just one, no same length "ties"). Usage: ./longestseq example.fa\n"); exit(0); }
                fp = fopen(argv[1],"r");
                best_spot = cur_spot = last_spot = ftell(fp);
                cursize = bestsofarsize = 0;
                while (fgets(s,99999,fp))
                {
                for (i=0;s[i];i++) { if ((s[i]=='\r')||(s[i]=='\n')) {s[i] = (char)0;break;}} //handle dos or unix
                if (s[0]=='>')
                {
                last_spot = cur_spot;
                cursize = 0;
                continue;
                }
                cursize += strlen(s);
                if (cursize > bestsofarsize)
                {
                bestsofarsize = cursize;
                best_spot = last_spot;
                }
                cur_spot = ftell(fp);
                }

                fseek(fp,best_spot,SEEK_SET);
                fgets(s,99999,fp);
                printf("%s",s);
                while (fgets(s,99999,fp))
                {
                if (s[0] == '>') break;
                printf("%s",s);
                }
                fclose(fp);
                return 0;
                }


                [FONT="Courier New"][SIZE="1"]____

                Comment

                Latest Articles

                Collapse

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                14 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                24 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                31 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 11:40 AM
                0 responses
                23 views
                0 reactions
                Last Post SEQadmin2  
                Working...