Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Fad2012
    Member
    • Sep 2012
    • 62

    How to join the line wraps in FASTA file

    Hello

    I have got a multi fasta file contains around 10000 sequences, and they look like this:
    >r23.1
    GAAGTGGTTGTTCGTGGTGGTACTGGTTTGTCCCCCCTTCTATTGAACTTGGTTTTGTGCGTCTA
    AGTTACG
    >r35.1
    GTGGCCATTTTACCAAATCATGCCTCACCTGGTGAGAGTATTGTAATTGATGGTAAAGAGGTTGA
    AATGCTT
    I want to remove the wraps and join the sequences (linearize the sequences) so they would look like this:
    >r23.1
    GAAGTGGTTGTTCGTGGTGGTACTGGTTTGTCCCCCCTTCTATTGAACTTGGTTTTGTGCGTCTAAGTTACG
    >r35.1
    GTGGCCATTTTACCAAATCATGCCTCACCTGGTGAGAGTATTGTAATTGATGGTAAAGAGGTTGAAATGCTT
    Is there any simple command line in mac OS X terminal can do this job, or maybe any simple perl code?

    many thanks
  • maasha
    Senior Member
    • Apr 2009
    • 153

    #2
    Wrapped FASTA entries are a pain ).

    You can install Biopieces (www.biopieces.org) on your Mac and run:

    Code:
    read_fasta -i in.fna | write_fasta -o out.fna -x
    and Biopieces can help you in a lot of other ways.

    Comment

    • Richard Finney
      Senior Member
      • Feb 2009
      • 701

      #3
      just use awk ...
      cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta
      Last edited by Richard Finney; 02-18-2013, 01:11 PM.

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        Originally posted by maasha View Post
        Wrapped FASTA entries are a pain ).
        Only if you're dealing with short reads. In general wrapping FASTA makes sense - consider long genes, or whole genomes/chromosomes.

        Comment

        • maasha
          Senior Member
          • Apr 2009
          • 153

          #5
          Originally posted by maubp View Post
          Only if you're dealing with short reads. In general wrapping FASTA makes sense - consider long genes, or whole genomes/chromosomes.
          Since there is no standard on how to wrap FASTA lines - like what length and what delimiter to use - wrapped FASTA entries are not readily indexed and random access is difficult. The argument of human readability is entirely lost when it comes to full genomes or other long sequences. Hence wrapped FASTA entries are stupid.

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            60 and 80 character wrapping is common, and the delimiter is the new line character (OS specific as normal with plain text). As long as the line wrapping is consistent, they are readily indexed for random access - see for example faidx from Heng Li.

            Comment

            • maasha
              Senior Member
              • Apr 2009
              • 153

              #7
              Assumptions on wrapping is dangerous ... But there is no need to wrap. So dont!

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                Originally posted by maasha View Post
                Assumptions on wrapping is dangerous ...
                Agreed, which is why code should spot data which breaks any such assumptions, and gives nice clear error messages.
                But there is no need to wrap. So dont!
                FASTA line wrapping is needed for ease of use in text editors - lots of people still put together gene sets by hand, for instance as input to building a tree, etc. FASTA is a flexible human readable plain text file format, therefore line wrapping is normal and to be expected - even if it does cause trouble.

                Look on the bright side, you could have been given sequences as a Word document with colour coding instead

                Comment

                • kmcarr
                  Senior Member
                  • May 2008
                  • 1181

                  #9
                  Originally posted by maasha View Post
                  Since there is no standard on how to wrap FASTA lines - like what length and what delimiter to use - wrapped FASTA entries are not readily indexed and random access is difficult. The argument of human readability is entirely lost when it comes to full genomes or other long sequences. Hence wrapped FASTA entries are stupid.
                  Nonsense! People have been easily managing wrapped FASTAs for a good long time now; there is nothing very challenging about it. The argument of random access is lost since that was not a design objective of the format.

                  Comment

                  • maasha
                    Senior Member
                    • Apr 2009
                    • 153

                    #10
                    Originally posted by maubp View Post
                    Agreed, which is why code should spot data which breaks any such assumptions, and gives nice clear error messages.

                    FASTA line wrapping is needed for ease of use in text editors - lots of people still put together gene sets by hand, for instance as input to building a tree, etc. FASTA is a flexible human readable plain text file format, therefore line wrapping is normal and to be expected - even if it does cause trouble.

                    Look on the bright side, you could have been given sequences as a Word document with colour coding instead
                    Using a text editor on long wrapped sequences makes it impossible to use the editors search function to locate a subsequence. Also, you cannot jump to a specific position - I would say that these are more argument against wrapping. Editing huge sequences with a text editor is probably not wise (I do it on rare occasions).

                    Word docs - oh dear - I have been so fortunate not to worry about those for quite a while

                    Comment

                    • Richard Finney
                      Senior Member
                      • Feb 2009
                      • 701

                      #11
                      Actually, wrapping might make it *easier* to find a sub sequence; you can find a sequence that previously wrapped. A wrapped fasta piped to grep would find the subsequence being sought. Of course a custom search or wrap aware search could find it, but using the canonical Unix command line utilities is often a good solution. In practice, you might want the label and sequence ... or do you want all sequences only ... or just that it exists ... or do just want a random one? The best solution ... a custom program or a cobbled together "cat/awk/grep" job all depends on your requirements.

                      Hacking a fasta with vi,emacs or (ugh) Word is for small files and is probably fast enough for those little one-offs.
                      Last edited by Richard Finney; 02-19-2013, 05:18 AM.

                      Comment

                      • Fad2012
                        Member
                        • Sep 2012
                        • 62

                        #12
                        Thanks for this very informative argument!

                        Richard Finney, the code you gave me has perfectly worked, many thanks . i would appreciate a very small explanation on how does this code work, and what would be the best way for me to start learning how to make up such codes like this?

                        Yours

                        Comment

                        • Richard Finney
                          Senior Member
                          • Feb 2009
                          • 701

                          #13
                          sure.

                          Read the documentation for the commands like awk,sed,grep,sort,find,uniq and check out "commandline fu" at http://www.commandlinefu.com/commands/browse (commandlinefu.com) for inspirations. The rip an mp3 from Youtube is quite useful ... oh and so is the sorting a hexdump to troubleshoot a system panic when you hack the inline assembly just before a system call while optimizing your terrabytes of cancer data cruncher. Check it out.

                          The code with comments and formatted "nicer" is this ...
                          Code:
                          ### The start { and then end } is code that gets executed for every line.
                          {
                                  #### '#' means comment for rest of line
                                  #### note that printf and print behave differently, print has implicit linefeed
                          
                                  #### so below this to end squiggle gets executed for every line ...
                              if (substr($0,1,1)==">")  ## if it's a ">" fasta header line; first char is '>'
                              {
                                  if (p) #### p is just a counter, it's implicitly initialized to zero
                                  {   # so p is not zero ...
                                      print "\n";   ### output a linefeed for the PREVIOUS fasta record
                                  }
                                  print $0   ## output fasta header WITH linefeed
                              }
                              else
                                  printf("%s",$0);    #output line without linefeed
                              p++;  # we're not in kansas anymore nor are we on the first line (p!=0)
                          }
                          
                          END
                          {   ####  this "END" section only gets executed at the end of processing
                              print "\n"     ## output linefeed , don't leave the last line without a closing linefeed
                          }

                          Comment

                          • Fad2012
                            Member
                            • Sep 2012
                            • 62

                            #14
                            Thanks very much, Richard and Charlie

                            Comment

                            • maubp
                              Peter (Biopython etc)
                              • Jul 2009
                              • 1544

                              #15
                              Honestly learning QBASIC for working with simple bioinformatics scripting seems a rather idiosyncratic choice - it would be far more useful to learn Perl, Python or Ruby which are still in active use, and also cross-platform. Picking whichever your local experts use is a practical decision.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 06:09 AM
                              0 responses
                              15 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              34 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              39 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              47 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...