SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Updated How to convert .txt file to .bed .GFF or .BAR file format, forevermark4 Bioinformatics 2 06-30-2014 06:02 AM
Align multiple sequences in tabular or fasta format pchiang Bioinformatics 7 07-01-2011 11:18 PM
Looking process to convert gff3 format into ace format or sam format andylai Bioinformatics 1 05-17-2011 03:09 AM
can mapview format convert to map format of MAQ chenw Illumina/Solexa 4 07-25-2009 12:28 AM
Help convert maq map format to eland format hard998 Bioinformatics 0 07-20-2009 08:37 PM

Reply
 
Thread Tools
Old 07-31-2012, 03:05 PM   #1
yangjianhunt
Member
 
Location: Houston TX USA

Join Date: Jun 2012
Posts: 13
Default What tools can convert sequence file from tabular format to fasta format?

Dear bioinformatics community,

I have got several deepseq files in tab delimited format: for example in this tabular format:
TAGGAACCATTAGCCAACAA 88889
GATTAGGCCCAAATGCAAAG 7799
....

or in this tabular format:
1 1 3233 223322 TAGGGCCTTAGGAAGCCTAA
1 1 3234 222334 AGGTAACCGATAGAGGTCCA
....

I would like to convert these files to fasta format. If I can use one or multiple non-sequence column as the fasta seq title, that will be nice.

What tools or scripts can I use to achieve this?


-these files are pretty big -around 400Mb, so I cannot use excel to do the job.

( I don't have programming skill yet. I google searched and found tab2fasta.pl within HOMER package, and bioscripts.convert (but this doesn't work because it requires 1st column to be name and 2nd to be sequence). I haven't tried HOMER package yet. Thought I would get some insights from your guys first.)


Thanks a lot!

Jian
yangjianhunt is offline   Reply With Quote
Old 08-01-2012, 01:16 AM   #2
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Hi,

This python script should do it. Say your tab separated file is tabseq.tsv:

Code:
A       B       3233    223322  TAGGGCCTTAGGAAGCCTAA
C       D       3234    222334  AGGTAACCGATAGAGGTCCA
Column 5 is the sequence, one or more of the other columns to be used as header.

Code:
python tab2fasta.py tabseq.tsv 5 1 2 4  > tabseq.fa
Output (tabseq.fa) will be:
Code:
>A_B_223322
TAGGGCCTTAGGAAGCCTAA
>C_D_222334
AGGTAACCGATAGAGGTCCA
Here's the code for tab2fasta.py:

Code:
#!/usr/local/bin/python

docstring= """
DESCRIPTION
    Convert tabular to FASTA

USAGE:
    python tab2fasta.py <tab-file> <sequence column> <header column 1> <header column 2> <header column n>  > <outfile>
"""

import sys
if len(sys.argv) < 4:
    sys.exit('\nThree or more arguments required%s' %(docstring))
    
infile= open(sys.argv[1])
seqix= int(sys.argv[2]) - 1 
headerix= sys.argv[3:]
headerix= [(int(x) - 1) for x in headerix]

for line in infile:
    line= line.strip().split('\t')
    header= '>' + '_'.join([line[i] for i in headerix])
    print(header)
    print(line[seqix])

infile.close()
I've done minimal testing so make sure it does what you want!

Good luck
Dario
dariober is offline   Reply With Quote
Old 08-01-2012, 06:40 AM   #3
essvee
Member
 
Location: Guelph

Join Date: Apr 2011
Posts: 11
Default

or if your file is tabseq.tsv:
Code:
A       B       3233    223322  TAGGGCCTTAGGAAGCCTAA
C       D       3234    222334  AGGTAACCGATAGAGGTCCA
you can use awk to do this easily:
Code:
awk '{print ">"$1"_"$2"_"$3"_"$4"\n"$5}' tabseq.tsv > seqs.fa
The $1, $2, etc are the column numbers, you can change these to whichever order you'd like, for example, for the other format:
Code:
TAGGAACCATTAGCCAACAA  88889
GATTAGGCCCAAATGCAAAG  7799
you could do:
Code:
awk '{print ">"$2"\n"$1}' tabseq.tsv > seqs.fa
essvee is offline   Reply With Quote
Old 08-01-2012, 06:41 AM   #4
yangjianhunt
Member
 
Location: Houston TX USA

Join Date: Jun 2012
Posts: 13
Default You are tremendous help!

Hi Dario,

I cannot thank you enough.
I will test the code, modify it if necessary.
Yesterday I was watching the MIT opencourse on beginner programing -they use python as the example language. It's going to take at least a month to learn programming by it. I'd like to learn it. But I want to get the immediate problem solved!

Regards,
Jian
yangjianhunt is offline   Reply With Quote
Old 08-01-2012, 06:46 AM   #5
yangjianhunt
Member
 
Location: Houston TX USA

Join Date: Jun 2012
Posts: 13
Default Thanks, essvee!

Wow, The awk solution is so simple and elegant!
I will try these as well.

I've used a few times of awk-but only through google. I never tried to fully understand the awk language. It's great for parsing!

Thank you thank you thank you!

Jian
yangjianhunt is offline   Reply With Quote
Old 03-26-2014, 02:48 PM   #6
musta1234
Member
 
Location: Earth

Join Date: Jun 2013
Posts: 10
Default

This is awesome.... thanks for the awk and python scripts

Mustapha
musta1234 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:49 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO