Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • What tools can convert sequence file from tabular format to fasta format?

    Dear bioinformatics community,

    I have got several deepseq files in tab delimited format: for example in this tabular format:
    TAGGAACCATTAGCCAACAA 88889
    GATTAGGCCCAAATGCAAAG 7799
    ....

    or in this tabular format:
    1 1 3233 223322 TAGGGCCTTAGGAAGCCTAA
    1 1 3234 222334 AGGTAACCGATAGAGGTCCA
    ....

    I would like to convert these files to fasta format. If I can use one or multiple non-sequence column as the fasta seq title, that will be nice.

    What tools or scripts can I use to achieve this?


    -these files are pretty big -around 400Mb, so I cannot use excel to do the job.

    ( I don't have programming skill yet. I google searched and found tab2fasta.pl within HOMER package, and bioscripts.convert (but this doesn't work because it requires 1st column to be name and 2nd to be sequence). I haven't tried HOMER package yet. Thought I would get some insights from your guys first.)


    Thanks a lot!

    Jian

  • #2
    Hi,

    This python script should do it. Say your tab separated file is tabseq.tsv:

    Code:
    A       B       3233    223322  TAGGGCCTTAGGAAGCCTAA
    C       D       3234    222334  AGGTAACCGATAGAGGTCCA
    Column 5 is the sequence, one or more of the other columns to be used as header.

    Code:
    python tab2fasta.py tabseq.tsv 5 1 2 4  > tabseq.fa
    Output (tabseq.fa) will be:
    Code:
    >A_B_223322
    TAGGGCCTTAGGAAGCCTAA
    >C_D_222334
    AGGTAACCGATAGAGGTCCA
    Here's the code for tab2fasta.py:

    Code:
    #!/usr/local/bin/python
    
    docstring= """
    DESCRIPTION
        Convert tabular to FASTA
    
    USAGE:
        python tab2fasta.py <tab-file> <sequence column> <header column 1> <header column 2> <header column n>  > <outfile>
    """
    
    import sys
    if len(sys.argv) < 4:
        sys.exit('\nThree or more arguments required%s' %(docstring))
        
    infile= open(sys.argv[1])
    seqix= int(sys.argv[2]) - 1 
    headerix= sys.argv[3:]
    headerix= [(int(x) - 1) for x in headerix]
    
    for line in infile:
        line= line.strip().split('\t')
        header= '>' + '_'.join([line[i] for i in headerix])
        print(header)
        print(line[seqix])
    
    infile.close()
    I've done minimal testing so make sure it does what you want!

    Good luck
    Dario

    Comment


    • #3
      or if your file is tabseq.tsv:
      Code:
      A       B       3233    223322  TAGGGCCTTAGGAAGCCTAA
      C       D       3234    222334  AGGTAACCGATAGAGGTCCA
      you can use awk to do this easily:
      Code:
      awk '{print ">"$1"_"$2"_"$3"_"$4"\n"$5}' tabseq.tsv > seqs.fa
      The $1, $2, etc are the column numbers, you can change these to whichever order you'd like, for example, for the other format:
      Code:
      TAGGAACCATTAGCCAACAA  88889
      GATTAGGCCCAAATGCAAAG  7799
      you could do:
      Code:
      awk '{print ">"$2"\n"$1}' tabseq.tsv > seqs.fa

      Comment


      • #4
        You are tremendous help!

        Hi Dario,

        I cannot thank you enough.
        I will test the code, modify it if necessary.
        Yesterday I was watching the MIT opencourse on beginner programing -they use python as the example language. It's going to take at least a month to learn programming by it. I'd like to learn it. But I want to get the immediate problem solved!

        Regards,
        Jian

        Comment


        • #5
          Thanks, essvee!

          Wow, The awk solution is so simple and elegant!
          I will try these as well.

          I've used a few times of awk-but only through google. I never tried to fully understand the awk language. It's great for parsing!

          Thank you thank you thank you!

          Jian

          Comment


          • #6
            This is awesome.... thanks for the awk and python scripts

            Mustapha

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            17 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            46 views
            0 likes
            Last Post seqadmin  
            Working...
            X