Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to join the line wraps in FASTA file

    Hello

    I have got a multi fasta file contains around 10000 sequences, and they look like this:
    >r23.1
    GAAGTGGTTGTTCGTGGTGGTACTGGTTTGTCCCCCCTTCTATTGAACTTGGTTTTGTGCGTCTA
    AGTTACG
    >r35.1
    GTGGCCATTTTACCAAATCATGCCTCACCTGGTGAGAGTATTGTAATTGATGGTAAAGAGGTTGA
    AATGCTT
    I want to remove the wraps and join the sequences (linearize the sequences) so they would look like this:
    >r23.1
    GAAGTGGTTGTTCGTGGTGGTACTGGTTTGTCCCCCCTTCTATTGAACTTGGTTTTGTGCGTCTAAGTTACG
    >r35.1
    GTGGCCATTTTACCAAATCATGCCTCACCTGGTGAGAGTATTGTAATTGATGGTAAAGAGGTTGAAATGCTT
    Is there any simple command line in mac OS X terminal can do this job, or maybe any simple perl code?

    many thanks

  • #2
    Wrapped FASTA entries are a pain ).

    You can install Biopieces (www.biopieces.org) on your Mac and run:

    Code:
    read_fasta -i in.fna | write_fasta -o out.fna -x
    and Biopieces can help you in a lot of other ways.

    Comment


    • #3
      just use awk ...
      cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta
      Last edited by Richard Finney; 02-18-2013, 01:11 PM.

      Comment


      • #4
        Originally posted by maasha View Post
        Wrapped FASTA entries are a pain ).
        Only if you're dealing with short reads. In general wrapping FASTA makes sense - consider long genes, or whole genomes/chromosomes.

        Comment


        • #5
          Originally posted by maubp View Post
          Only if you're dealing with short reads. In general wrapping FASTA makes sense - consider long genes, or whole genomes/chromosomes.
          Since there is no standard on how to wrap FASTA lines - like what length and what delimiter to use - wrapped FASTA entries are not readily indexed and random access is difficult. The argument of human readability is entirely lost when it comes to full genomes or other long sequences. Hence wrapped FASTA entries are stupid.

          Comment


          • #6
            60 and 80 character wrapping is common, and the delimiter is the new line character (OS specific as normal with plain text). As long as the line wrapping is consistent, they are readily indexed for random access - see for example faidx from Heng Li.

            Comment


            • #7
              Assumptions on wrapping is dangerous ... But there is no need to wrap. So dont!

              Comment


              • #8
                Originally posted by maasha View Post
                Assumptions on wrapping is dangerous ...
                Agreed, which is why code should spot data which breaks any such assumptions, and gives nice clear error messages.
                But there is no need to wrap. So dont!
                FASTA line wrapping is needed for ease of use in text editors - lots of people still put together gene sets by hand, for instance as input to building a tree, etc. FASTA is a flexible human readable plain text file format, therefore line wrapping is normal and to be expected - even if it does cause trouble.

                Look on the bright side, you could have been given sequences as a Word document with colour coding instead

                Comment


                • #9
                  Originally posted by maasha View Post
                  Since there is no standard on how to wrap FASTA lines - like what length and what delimiter to use - wrapped FASTA entries are not readily indexed and random access is difficult. The argument of human readability is entirely lost when it comes to full genomes or other long sequences. Hence wrapped FASTA entries are stupid.
                  Nonsense! People have been easily managing wrapped FASTAs for a good long time now; there is nothing very challenging about it. The argument of random access is lost since that was not a design objective of the format.

                  Comment


                  • #10
                    Originally posted by maubp View Post
                    Agreed, which is why code should spot data which breaks any such assumptions, and gives nice clear error messages.

                    FASTA line wrapping is needed for ease of use in text editors - lots of people still put together gene sets by hand, for instance as input to building a tree, etc. FASTA is a flexible human readable plain text file format, therefore line wrapping is normal and to be expected - even if it does cause trouble.

                    Look on the bright side, you could have been given sequences as a Word document with colour coding instead
                    Using a text editor on long wrapped sequences makes it impossible to use the editors search function to locate a subsequence. Also, you cannot jump to a specific position - I would say that these are more argument against wrapping. Editing huge sequences with a text editor is probably not wise (I do it on rare occasions).

                    Word docs - oh dear - I have been so fortunate not to worry about those for quite a while

                    Comment


                    • #11
                      Actually, wrapping might make it *easier* to find a sub sequence; you can find a sequence that previously wrapped. A wrapped fasta piped to grep would find the subsequence being sought. Of course a custom search or wrap aware search could find it, but using the canonical Unix command line utilities is often a good solution. In practice, you might want the label and sequence ... or do you want all sequences only ... or just that it exists ... or do just want a random one? The best solution ... a custom program or a cobbled together "cat/awk/grep" job all depends on your requirements.

                      Hacking a fasta with vi,emacs or (ugh) Word is for small files and is probably fast enough for those little one-offs.
                      Last edited by Richard Finney; 02-19-2013, 05:18 AM.

                      Comment


                      • #12
                        Thanks for this very informative argument!

                        Richard Finney, the code you gave me has perfectly worked, many thanks . i would appreciate a very small explanation on how does this code work, and what would be the best way for me to start learning how to make up such codes like this?

                        Yours

                        Comment


                        • #13
                          sure.

                          Read the documentation for the commands like awk,sed,grep,sort,find,uniq and check out "commandline fu" at http://www.commandlinefu.com/commands/browse (commandlinefu.com) for inspirations. The rip an mp3 from Youtube is quite useful ... oh and so is the sorting a hexdump to troubleshoot a system panic when you hack the inline assembly just before a system call while optimizing your terrabytes of cancer data cruncher. Check it out.

                          The code with comments and formatted "nicer" is this ...
                          Code:
                          ### The start { and then end } is code that gets executed for every line.
                          {
                                  #### '#' means comment for rest of line
                                  #### note that printf and print behave differently, print has implicit linefeed
                          
                                  #### so below this to end squiggle gets executed for every line ...
                              if (substr($0,1,1)==">")  ## if it's a ">" fasta header line; first char is '>'
                              {
                                  if (p) #### p is just a counter, it's implicitly initialized to zero
                                  {   # so p is not zero ...
                                      print "\n";   ### output a linefeed for the PREVIOUS fasta record
                                  }
                                  print $0   ## output fasta header WITH linefeed
                              }
                              else
                                  printf("%s",$0);    #output line without linefeed
                              p++;  # we're not in kansas anymore nor are we on the first line (p!=0)
                          }
                          
                          END
                          {   ####  this "END" section only gets executed at the end of processing
                              print "\n"     ## output linefeed , don't leave the last line without a closing linefeed
                          }

                          Comment


                          • #14
                            Thanks very much, Richard and Charlie

                            Comment


                            • #15
                              Honestly learning QBASIC for working with simple bioinformatics scripting seems a rather idiosyncratic choice - it would be far more useful to learn Perl, Python or Ruby which are still in active use, and also cross-platform. Picking whichever your local experts use is a practical decision.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Investigating the Gut Microbiome Through Diet and Spatial Biology
                                by seqadmin




                                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                                02-24-2025, 06:31 AM
                              • seqadmin
                                Quality Control Essentials for Next-Generation Sequencing Workflows
                                by seqadmin




                                Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.

                                Nucleic Acid Quality Control
                                Preparing for NGS starts with isolating the...
                                02-10-2025, 01:58 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 01:15 PM
                              0 responses
                              2 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-28-2025, 12:58 PM
                              0 responses
                              79 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-24-2025, 02:48 PM
                              0 responses
                              409 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-21-2025, 02:46 PM
                              0 responses
                              231 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X