Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • yangjianhunt
    Member
    • Jun 2012
    • 14

    What tools can convert sequence file from tabular format to fasta format?

    Dear bioinformatics community,

    I have got several deepseq files in tab delimited format: for example in this tabular format:
    TAGGAACCATTAGCCAACAA 88889
    GATTAGGCCCAAATGCAAAG 7799
    ....

    or in this tabular format:
    1 1 3233 223322 TAGGGCCTTAGGAAGCCTAA
    1 1 3234 222334 AGGTAACCGATAGAGGTCCA
    ....

    I would like to convert these files to fasta format. If I can use one or multiple non-sequence column as the fasta seq title, that will be nice.

    What tools or scripts can I use to achieve this?


    -these files are pretty big -around 400Mb, so I cannot use excel to do the job.

    ( I don't have programming skill yet. I google searched and found tab2fasta.pl within HOMER package, and bioscripts.convert (but this doesn't work because it requires 1st column to be name and 2nd to be sequence). I haven't tried HOMER package yet. Thought I would get some insights from your guys first.)


    Thanks a lot!

    Jian
  • dariober
    Senior Member
    • May 2010
    • 311

    #2
    Hi,

    This python script should do it. Say your tab separated file is tabseq.tsv:

    Code:
    A       B       3233    223322  TAGGGCCTTAGGAAGCCTAA
    C       D       3234    222334  AGGTAACCGATAGAGGTCCA
    Column 5 is the sequence, one or more of the other columns to be used as header.

    Code:
    python tab2fasta.py tabseq.tsv 5 1 2 4  > tabseq.fa
    Output (tabseq.fa) will be:
    Code:
    >A_B_223322
    TAGGGCCTTAGGAAGCCTAA
    >C_D_222334
    AGGTAACCGATAGAGGTCCA
    Here's the code for tab2fasta.py:

    Code:
    #!/usr/local/bin/python
    
    docstring= """
    DESCRIPTION
        Convert tabular to FASTA
    
    USAGE:
        python tab2fasta.py <tab-file> <sequence column> <header column 1> <header column 2> <header column n>  > <outfile>
    """
    
    import sys
    if len(sys.argv) < 4:
        sys.exit('\nThree or more arguments required%s' %(docstring))
        
    infile= open(sys.argv[1])
    seqix= int(sys.argv[2]) - 1 
    headerix= sys.argv[3:]
    headerix= [(int(x) - 1) for x in headerix]
    
    for line in infile:
        line= line.strip().split('\t')
        header= '>' + '_'.join([line[i] for i in headerix])
        print(header)
        print(line[seqix])
    
    infile.close()
    I've done minimal testing so make sure it does what you want!

    Good luck
    Dario

    Comment

    • essvee
      Member
      • Apr 2011
      • 11

      #3
      or if your file is tabseq.tsv:
      Code:
      A       B       3233    223322  TAGGGCCTTAGGAAGCCTAA
      C       D       3234    222334  AGGTAACCGATAGAGGTCCA
      you can use awk to do this easily:
      Code:
      awk '{print ">"$1"_"$2"_"$3"_"$4"\n"$5}' tabseq.tsv > seqs.fa
      The $1, $2, etc are the column numbers, you can change these to whichever order you'd like, for example, for the other format:
      Code:
      TAGGAACCATTAGCCAACAA  88889
      GATTAGGCCCAAATGCAAAG  7799
      you could do:
      Code:
      awk '{print ">"$2"\n"$1}' tabseq.tsv > seqs.fa

      Comment

      • yangjianhunt
        Member
        • Jun 2012
        • 14

        #4
        You are tremendous help!

        Hi Dario,

        I cannot thank you enough.
        I will test the code, modify it if necessary.
        Yesterday I was watching the MIT opencourse on beginner programing -they use python as the example language. It's going to take at least a month to learn programming by it. I'd like to learn it. But I want to get the immediate problem solved!

        Regards,
        Jian

        Comment

        • yangjianhunt
          Member
          • Jun 2012
          • 14

          #5
          Thanks, essvee!

          Wow, The awk solution is so simple and elegant!
          I will try these as well.

          I've used a few times of awk-but only through google. I never tried to fully understand the awk language. It's great for parsing!

          Thank you thank you thank you!

          Jian

          Comment

          • musta1234
            Member
            • Jun 2013
            • 10

            #6
            This is awesome.... thanks for the awk and python scripts

            Mustapha

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            24 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            40 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            47 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            49 views
            0 reactions
            Last Post SEQadmin2  
            Working...