Seqanswers Leaderboard Ad

**maasha** · 02-18-2013, 12:10 PM

Wrapped FASTA entries are a pain

).

You can install Biopieces (www.biopieces.org) on your Mac and run:

Code:

read_fasta -i in.fna | write_fasta -o out.fna -x

and Biopieces can help you in a lot of other ways.

**Richard Finney** · 02-18-2013, 01:06 PM

just use awk ...
cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta

**maubp** · 02-19-2013, 03:45 AM

Originally posted by maasha View Post

Wrapped FASTA entries are a pain

).

Only if you're dealing with short reads. In general wrapping FASTA makes sense - consider long genes, or whole genomes/chromosomes.

**maasha** · 02-19-2013, 03:53 AM

Originally posted by maubp View Post

Only if you're dealing with short reads. In general wrapping FASTA makes sense - consider long genes, or whole genomes/chromosomes.

Since there is no standard on how to wrap FASTA lines - like what length and what delimiter to use - wrapped FASTA entries are not readily indexed and random access is difficult. The argument of human readability is entirely lost when it comes to full genomes or other long sequences. Hence wrapped FASTA entries are stupid.

**maubp** · 02-19-2013, 03:58 AM

60 and 80 character wrapping is common, and the delimiter is the new line character (OS specific as normal with plain text). As long as the line wrapping is consistent, they are readily indexed for random access - see for example faidx from Heng Li.

**maasha** · 02-19-2013, 04:17 AM

Assumptions on wrapping is dangerous ... But there is no need to wrap. So dont!

**maubp** · 02-19-2013, 04:26 AM

Originally posted by maasha View Post

Assumptions on wrapping is dangerous ...

Agreed, which is why code should spot data which breaks any such assumptions, and gives nice clear error messages.

But there is no need to wrap. So dont!

FASTA line wrapping is needed for ease of use in text editors - lots of people still put together gene sets by hand, for instance as input to building a tree, etc. FASTA is a flexible human readable plain text file format, therefore line wrapping is normal and to be expected - even if it does cause trouble.

Look on the bright side, you could have been given sequences as a Word document with colour coding instead

**kmcarr** · 02-19-2013, 04:27 AM

Originally posted by maasha View Post

Since there is no standard on how to wrap FASTA lines - like what length and what delimiter to use - wrapped FASTA entries are not readily indexed and random access is difficult. The argument of human readability is entirely lost when it comes to full genomes or other long sequences. Hence wrapped FASTA entries are stupid.

Nonsense! People have been easily managing wrapped FASTAs for a good long time now; there is nothing very challenging about it. The argument of random access is lost since that was not a design objective of the format.

**maasha** · 02-19-2013, 04:53 AM

Originally posted by maubp View Post

Agreed, which is why code should spot data which breaks any such assumptions, and gives nice clear error messages.

FASTA line wrapping is needed for ease of use in text editors - lots of people still put together gene sets by hand, for instance as input to building a tree, etc. FASTA is a flexible human readable plain text file format, therefore line wrapping is normal and to be expected - even if it does cause trouble.

Look on the bright side, you could have been given sequences as a Word document with colour coding instead

Using a text editor on long wrapped sequences makes it impossible to use the editors search function to locate a subsequence. Also, you cannot jump to a specific position - I would say that these are more argument against wrapping. Editing huge sequences with a text editor is probably not wise (I do it on rare occasions).

Word docs - oh dear - I have been so fortunate not to worry about those for quite a while

**Richard Finney** · 02-19-2013, 05:04 AM

Actually, wrapping might make it *easier* to find a sub sequence; you can find a sequence that previously wrapped. A wrapped fasta piped to grep would find the subsequence being sought. Of course a custom search or wrap aware search could find it, but using the canonical Unix command line utilities is often a good solution. In practice, you might want the label and sequence ... or do you want all sequences only ... or just that it exists ... or do just want a random one? The best solution ... a custom program or a cobbled together "cat/awk/grep" job all depends on your requirements.

Hacking a fasta with vi,emacs or (ugh) Word is for small files and is probably fast enough for those little one-offs.

**Fad2012** · 02-20-2013, 06:03 AM

Thanks for this very informative argument!

Richard Finney, the code you gave me has perfectly worked, many thanks

. i would appreciate a very small explanation on how does this code work, and what would be the best way for me to start learning how to make up such codes like this?

Yours

**Richard Finney** · 02-20-2013, 09:34 AM

sure.

Read the documentation for the commands like awk,sed,grep,sort,find,uniq and check out "commandline fu" at http://www.commandlinefu.com/commands/browse (commandlinefu.com) for inspirations. The rip an mp3 from Youtube is quite useful ... oh and so is the sorting a hexdump to troubleshoot a system panic when you hack the inline assembly just before a system call while optimizing your terrabytes of cancer data cruncher. Check it out.

The code with comments and formatted "nicer" is this ...

Code:

### The start { and then end } is code that gets executed for every line.
{
        #### '#' means comment for rest of line
        #### note that printf and print behave differently, print has implicit linefeed

        #### so below this to end squiggle gets executed for every line ...
    if (substr($0,1,1)==">")  ## if it's a ">" fasta header line; first char is '>'
    {
        if (p) #### p is just a counter, it's implicitly initialized to zero
        {   # so p is not zero ...
            print "\n";   ### output a linefeed for the PREVIOUS fasta record
        }
        print $0   ## output fasta header WITH linefeed
    }
    else
        printf("%s",$0);    #output line without linefeed
    p++;  # we're not in kansas anymore nor are we on the first line (p!=0)
}

END
{   ####  this "END" section only gets executed at the end of processing
    print "\n"     ## output linefeed , don't leave the last line without a closing linefeed
}

**Fad2012** · 02-21-2013, 03:58 AM

Thanks very much, Richard and Charlie

**maubp** · 02-24-2013, 07:41 AM

Honestly learning QBASIC for working with simple bioinformatics scripting seems a rather idiosyncratic choice - it would be far more useful to learn Perl, Python or Ruby which are still in active use, and also cross-platform. Picking whichever your local experts use is a practical decision.

Topics	Statistics	Last Post
TIGR Systems Offer a Compact Alternative to CRISPR for Gene Editing by seqadmin Started by seqadmin, Today, 01:15 PM	0 responses 2 views 0 likes	Last Post by seqadmin Today, 01:15 PM
Highlights from AGBT 2025 – Part II by seqadmin Started by seqadmin, 02-28-2025, 12:58 PM	0 responses 79 views 0 likes	Last Post by seqadmin 02-28-2025, 12:58 PM
Highlights from AGBT 2025 – Part I by seqadmin Started by seqadmin, 02-24-2025, 02:48 PM	0 responses 409 views 0 likes	Last Post by seqadmin 02-24-2025, 02:48 PM
Selecting the Right AI Model for Bioinformatics Research by seqadmin Started by seqadmin, 02-21-2025, 02:46 PM	0 responses 231 views 0 likes	Last Post by seqadmin 02-21-2025, 02:46 PM

Seqanswers Leaderboard Ad

Announcement

How to join the line wraps in FASTA file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News