Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fasta File Editing

    I have a file with text as:

    >APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
    >ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
    >NSVNPDVSQHSPERHFHTSEGTLC

    I need to change it by adding numbers and shifting the amino acid aequence to next line, basically into fasta format as folllows:
    >1
    APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
    >2
    ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
    >3
    NSVNPDVSQHSPERHFHTSEGTLC

  • #2
    Code:
    cat foo | sed 's/>//' | awk '{idx+=1;printf(">%i\n%s\n",idx,$0)}'
    or
    Code:
    cat foo | awk '{idx+=1;$1=substr($1,2,length($1));printf(">%i\n%s\n",idx,$1)}'
    or
    Code:
    cat foo | awk '{idx+=1;sub(/>/,sprintf(">%i\n",idx),$1);print $1}'
    among many other possibilities. You'll find that familiarizing yourself with the command line will come in useful.

    Comment


    • #3
      also try jedit

      Regex and beanshell can sort your problem out....

      Comment


      • #4
        This should work
        Code:
        $ perl -p -i.bak -e '$c+=1; s/>/>$c\n/g' your_file

        Comment


        • #5
          Thanks GenoMax. The output is as as follows:

          >1

          >2
          >APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
          >3

          >4
          >ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
          >5

          The order of the sequences is right but its introducing blank sequences of >1, >3 and >5.

          Could you please look into it?

          Comment


          • #6
            What OS are you doing this on? Did you edit/open this file on a PC/Mac?

            NOTE: Before you edit/change a file it is important to make a backup copy (specially if you spent a day or two getting it). I have added a cp command below that would preserve an original copy should you need to go back to it.

            Try the following first before you use the perl command (this will convert from windows to unix file format, if that is the issue though I am not certain). You will need to copy the .bak file (perl command made a backup of the original with .bak extension and changed the original so you can't use the original now) to the original name before you try this:

            Code:
            $ cp your_file.bak your_file.ORIG
            $ cp your_file.bak your_file
            $ awk '{ sub(/\r$/,""); print }' your_file
            Last edited by GenoMax; 08-11-2014, 04:25 PM. Reason: Added notes about keeping an original backup copy

            Comment


            • #7
              Code:
              sed 's/>//' inputFile | awk '{print ">"NR"\n"$0}'

              Comment


              • #8
                GenoMax - that didn't do anything. The .bak file has no numbers assigned and when I ran the awk command that was suggested it didn't make any changes or add numbers to the output file.

                Thanks rnaeye. The original file has a sequence #5 which is of two lines. The code is making the second line of the sequence as sequence #6 in the output. I probably need to make changes to the number of characters per line on the original file. Please advise regarding the same.

                The following are the input and output files:

                INPUT-
                >APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                >ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                >NSVNPDVSQHSPERHFHTSEGTLC
                >AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                >VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHY
                KEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                >MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                OUTPUT-
                >1
                APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                >2
                ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                >3
                NSVNPDVSQHSPERHFHTSEGTLC
                >4
                AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                >5
                VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHY
                >6
                KEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                >7
                MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                Comment


                • #9
                  Thanks dpryan - the third code works effectively but it skips numbers for a sequence following the one which has it on two lines; say sequence #5 has two lines for which the output would be >5 followed by >7, skipping >6. This explains better:

                  >4
                  AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                  >5
                  VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHY
                  KEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                  >7
                  MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                  I can live with it for now. I'll follow your advice and try to familiarize with the command line. Could you please fix the bug in the third code and let me know.....

                  Comment


                  • #10
                    Thanks ALL - I however have the issue with numbering sequences in order; removed the line delimiter and finally have the output file as:

                    >1
                    APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                    >2
                    ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                    >3
                    NSVNPDVSQHSPERHFHTSEGTLC
                    >4
                    AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                    >5
                    VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHYKEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                    >7
                    MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                    Please help me fix the issue of numbering sequences in order.......

                    Comment


                    • #11
                      That's less a bug than a feature request, but in any case it's pretty trivial to add support for multi-line entries:

                      Code:
                      cat foo | awk '{if(substr($1,1,1)==">"){idx+=1;sub(/>/,sprintf(">%i\n",idx),$1);}print $1}'

                      Comment


                      • #12
                        Finally.......it all looks good !

                        >1
                        APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                        >2
                        ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                        >3
                        NSVNPDVSQHSPERHFHTSEGTLC
                        >4
                        AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                        >5
                        VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHYKEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                        >6
                        MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG
                        >7
                        MADPDEVIPTVRDVSDAPFVGSDGSNVILNEDSFGGGDNGLEEFRGEGSMGK

                        Thank You all for your time !

                        Comment


                        • #13
                          concise mode:

                          Code:
                          cat input |  awk '/^>/{$1=">"++n"\n"substr($1,2)}1'

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 08:47 AM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          60 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          59 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X