Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combine FASTA files in a specific order based on sequence ID

    Hi all,

    I frequent these forums often but this is my first post.

    I've got a problem that I don't have the scripting skills to solve (nor the time to gain them at the moment).

    What I want to do is combine two multi fasta files in a specific order based on the sequence IDs.

    For example;

    file 1

    >seq1
    TTTGGATTACAAAGTTATTTAAATCACATGT....
    >seq2
    GCCGTGCCATTTCAATTACAAATACATAATA....

    file 2

    >seq1_probe1
    CTTTGTCCTTGTCCTTGGTGGCGG....
    >seq1_probe2
    ATTTCTTCTCATCCTCCTCCTCCTA....
    >seq2_probe1
    ACTAAAAACTCGTTGAAGAAATCC....
    >seq2_probe2
    AGGATATAACACACAGCCATCACC....

    In need to combined file to look like;

    >seq1
    TTTGGATTACAAAGTTATTTAAATCACATGT....
    >seq1_probe1
    CTTTGTCCTTGTCCTTGGTGGCGG....
    >seq1
    TTTGGATTACAAAGTTATTTAAATCACATGT....
    >seq1_probe2
    ATTTCTTCTCATCCTCCTCCTCCTA....
    >seq2
    GCCGTGCCATTTCAATTACAAATACATAATA....
    >seq2_probe1
    ACTAAAAACTCGTTGAAGAAATCC....
    >seq2
    GCCGTGCCATTTCAATTACAAATACATAATA....
    >seq2_probe2
    AGGATATAACACACAGCCATCACC....

    Note that only part of file 2's sequence IDs are common to file 1's.

    I'd prefer to use perl as that is the language I'm learning but any solution will suffice.

    Thanks for reading.

  • #2
    I am on my phone and can't type anything elegant (and I don't know perl), but if you want to get the job done with basic linux tools you can look up how to print out every other line in a file with sed (google sed one liners if you can't find it easily), make these separate files, then you can use the paste command followed by the tr command to convert the tabs to new line characters and get what you want. It is ugly but you should be able to figure it out quickly. Use the lines you posted above as test files so you don't waste time practicing with large files.

    Comment


    • #3
      Here's what I had in mind. Save this in a script, give yourself permission to execute it, and then run it as: ./script file1 file2 output

      Code:
      #! /bin/bash
      
      file_1=$1
      file_2=$2
      output=$3
      
      sed -n '1,${p;n}' $file_1 > temp1
      sed -n '1,${n;p}' $file_1 > temp2
      sed -n '1,${p;n;n;n}' $file_2 > temp3
      sed -n '1,${n;p;n;n}' $file_2 > temp4
      sed -n '1,${n;n;p;n}' $file_2 > temp5
      sed -n '1,${n;n;n;p}' $file_2 > temp6
      paste temp1 temp2 temp3 temp4 temp1 temp2 temp5 temp6 | tr '\t' '\n' > $output
      
      rm temp1 temp2 temp3 temp4 temp5 temp6
      This is quite inefficient with large files but should introduce some basic commands. You can make it a lot faster by running all of the sed commands together and then having it wait for them to complete prior to putting them together:

      Code:
      #! /bin/bash
      
      file_1=$1
      file_2=$2
      output=$3
      
      sed -n '1,${p;n}' $file_1 > temp1 &
      pid1=$!
      sed -n '1,${n;p}' $file_1 > temp2 &
      pid2=$!
      sed -n '1,${p;n;n;n}' $file_2 > temp3 &
      pid3=$!
      sed -n '1,${n;p;n;n}' $file_2 > temp4 &
      pid4=$!
      sed -n '1,${n;n;p;n}' $file_2 > temp5 &
      pid5=$!
      sed -n '1,${n;n;n;p}' $file_2 > temp6 &
      pid6=$!
      
      wait $pid1 $pid2 $pid3 $pid4 $pid5 $pid6
      
      paste temp1 temp2 temp3 temp4 temp1 temp2 temp5 temp6 | tr '\t' '\n' > $output
      
      rm temp1 temp2 temp3 temp4 temp5 temp6
      But obviously with perl you can read in both files and just output the lines in the order you desire. So definitely figure that out too. But it is nice to be able to get stuff done with linux commands while learning how to do things in a much better fashion with a scripting language, so if you can understand how this works that would also be useful.
      Last edited by Heisman; 07-08-2013, 09:44 PM.

      Comment


      • #4
        Assuming your files are called 1.fa and 2.fa, this hack will work:

        Code:
        samtools faidx 2.fa
        awk '{id=substr($1,2); getline; for (i=1;i<3;i++){print ">"id; print; system("samtools faidx 2.fa "id"_probe"i)}}' 1.fa
        awk is pretty powerful for this kind of thing.
        Last edited by martinghunt; 07-09-2013, 12:23 PM. Reason: didn't need the samtools faidx 1.fa command

        Comment


        • #5
          as a one-off solution:

          sed -e '$!N;s/\n/\t/' file1 > col1
          sed -e '$!N;s/\n/\t/' file2 | sed -e '$!N;s/\n/\t/' > col2
          paste col1 col2 | fmt -5

          >seq1
          TTTGGATTACAAAGTTATTTAAATCACATGT....
          >seq1_probe1
          CTTTGTCCTTGTCCTTGGTGGCGG....
          >seq1_probe2
          ATTTCTTCTCATCCTCCTCCTCCTA....
          >seq2
          GCCGTGCCATTTCAATTACAAATACATAATA....
          >seq2_probe1
          ACTAAAAACTCGTTGAAGAAATCC....
          >seq2_probe2
          AGGATATAACACACAGCCATCACC....

          Comment


          • #6
            how i can convert
            >1...>2....>3...>10000 to >1

            and
            >1..>2..>3....>10000 for b.fasta to >2 and
            same for for all my 5 samples

            Comment


            • #7
              Originally posted by huma Asif View Post
              how i can convert
              >1...>2....>3...>10000 to >1

              and
              >1..>2..>3....>10000 for b.fasta to >2 and
              same for for all my 5 samples
              I do not understand the question. Can you explain further?

              Comment


              • #8
                I agree with @HMorrison -- the question needs to be stated better. That said 'fastx_renamer' will rename FastA files.

                Comment


                • #9
                  This is the parent thread with "some" additional information: http://seqanswers.com/forums/showthread.php?t=46474

                  Comment


                  • #10
                    i created fasta from vcf file using target intervals so now in fasta file i have the same number of header as the coordinates in bed

                    so what i am doing is i want to cat all these sequences

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    66 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X