Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • yangjianhunt
    Member
    • Jun 2012
    • 14

    What tools can convert sequence file from tabular format to fasta format?

    Dear bioinformatics community,

    I have got several deepseq files in tab delimited format: for example in this tabular format:
    TAGGAACCATTAGCCAACAA 88889
    GATTAGGCCCAAATGCAAAG 7799
    ....

    or in this tabular format:
    1 1 3233 223322 TAGGGCCTTAGGAAGCCTAA
    1 1 3234 222334 AGGTAACCGATAGAGGTCCA
    ....

    I would like to convert these files to fasta format. If I can use one or multiple non-sequence column as the fasta seq title, that will be nice.

    What tools or scripts can I use to achieve this?


    -these files are pretty big -around 400Mb, so I cannot use excel to do the job.

    ( I don't have programming skill yet. I google searched and found tab2fasta.pl within HOMER package, and bioscripts.convert (but this doesn't work because it requires 1st column to be name and 2nd to be sequence). I haven't tried HOMER package yet. Thought I would get some insights from your guys first.)


    Thanks a lot!

    Jian
  • dariober
    Senior Member
    • May 2010
    • 311

    #2
    Hi,

    This python script should do it. Say your tab separated file is tabseq.tsv:

    Code:
    A       B       3233    223322  TAGGGCCTTAGGAAGCCTAA
    C       D       3234    222334  AGGTAACCGATAGAGGTCCA
    Column 5 is the sequence, one or more of the other columns to be used as header.

    Code:
    python tab2fasta.py tabseq.tsv 5 1 2 4  > tabseq.fa
    Output (tabseq.fa) will be:
    Code:
    >A_B_223322
    TAGGGCCTTAGGAAGCCTAA
    >C_D_222334
    AGGTAACCGATAGAGGTCCA
    Here's the code for tab2fasta.py:

    Code:
    #!/usr/local/bin/python
    
    docstring= """
    DESCRIPTION
        Convert tabular to FASTA
    
    USAGE:
        python tab2fasta.py <tab-file> <sequence column> <header column 1> <header column 2> <header column n>  > <outfile>
    """
    
    import sys
    if len(sys.argv) < 4:
        sys.exit('\nThree or more arguments required%s' %(docstring))
        
    infile= open(sys.argv[1])
    seqix= int(sys.argv[2]) - 1 
    headerix= sys.argv[3:]
    headerix= [(int(x) - 1) for x in headerix]
    
    for line in infile:
        line= line.strip().split('\t')
        header= '>' + '_'.join([line[i] for i in headerix])
        print(header)
        print(line[seqix])
    
    infile.close()
    I've done minimal testing so make sure it does what you want!

    Good luck
    Dario

    Comment

    • essvee
      Member
      • Apr 2011
      • 11

      #3
      or if your file is tabseq.tsv:
      Code:
      A       B       3233    223322  TAGGGCCTTAGGAAGCCTAA
      C       D       3234    222334  AGGTAACCGATAGAGGTCCA
      you can use awk to do this easily:
      Code:
      awk '{print ">"$1"_"$2"_"$3"_"$4"\n"$5}' tabseq.tsv > seqs.fa
      The $1, $2, etc are the column numbers, you can change these to whichever order you'd like, for example, for the other format:
      Code:
      TAGGAACCATTAGCCAACAA  88889
      GATTAGGCCCAAATGCAAAG  7799
      you could do:
      Code:
      awk '{print ">"$2"\n"$1}' tabseq.tsv > seqs.fa

      Comment

      • yangjianhunt
        Member
        • Jun 2012
        • 14

        #4
        You are tremendous help!

        Hi Dario,

        I cannot thank you enough.
        I will test the code, modify it if necessary.
        Yesterday I was watching the MIT opencourse on beginner programing -they use python as the example language. It's going to take at least a month to learn programming by it. I'd like to learn it. But I want to get the immediate problem solved!

        Regards,
        Jian

        Comment

        • yangjianhunt
          Member
          • Jun 2012
          • 14

          #5
          Thanks, essvee!

          Wow, The awk solution is so simple and elegant!
          I will try these as well.

          I've used a few times of awk-but only through google. I never tried to fully understand the awk language. It's great for parsing!

          Thank you thank you thank you!

          Jian

          Comment

          • musta1234
            Member
            • Jun 2013
            • 10

            #6
            This is awesome.... thanks for the awk and python scripts

            Mustapha

            Comment

            Latest Articles

            Collapse

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Yesterday, 10:09 AM
            0 responses
            10 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            17 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            26 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            21 views
            0 reactions
            Last Post SEQadmin2  
            Working...