Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fasta File Editing

    I have a file with text as:

    >APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
    >ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
    >NSVNPDVSQHSPERHFHTSEGTLC

    I need to change it by adding numbers and shifting the amino acid aequence to next line, basically into fasta format as folllows:
    >1
    APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
    >2
    ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
    >3
    NSVNPDVSQHSPERHFHTSEGTLC

  • #2
    Code:
    cat foo | sed 's/>//' | awk '{idx+=1;printf(">%i\n%s\n",idx,$0)}'
    or
    Code:
    cat foo | awk '{idx+=1;$1=substr($1,2,length($1));printf(">%i\n%s\n",idx,$1)}'
    or
    Code:
    cat foo | awk '{idx+=1;sub(/>/,sprintf(">%i\n",idx),$1);print $1}'
    among many other possibilities. You'll find that familiarizing yourself with the command line will come in useful.

    Comment


    • #3
      also try jedit

      Regex and beanshell can sort your problem out....

      Comment


      • #4
        This should work
        Code:
        $ perl -p -i.bak -e '$c+=1; s/>/>$c\n/g' your_file

        Comment


        • #5
          Thanks GenoMax. The output is as as follows:

          >1

          >2
          >APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
          >3

          >4
          >ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
          >5

          The order of the sequences is right but its introducing blank sequences of >1, >3 and >5.

          Could you please look into it?

          Comment


          • #6
            What OS are you doing this on? Did you edit/open this file on a PC/Mac?

            NOTE: Before you edit/change a file it is important to make a backup copy (specially if you spent a day or two getting it). I have added a cp command below that would preserve an original copy should you need to go back to it.

            Try the following first before you use the perl command (this will convert from windows to unix file format, if that is the issue though I am not certain). You will need to copy the .bak file (perl command made a backup of the original with .bak extension and changed the original so you can't use the original now) to the original name before you try this:

            Code:
            $ cp your_file.bak your_file.ORIG
            $ cp your_file.bak your_file
            $ awk '{ sub(/\r$/,""); print }' your_file
            Last edited by GenoMax; 08-11-2014, 04:25 PM. Reason: Added notes about keeping an original backup copy

            Comment


            • #7
              Code:
              sed 's/>//' inputFile | awk '{print ">"NR"\n"$0}'

              Comment


              • #8
                GenoMax - that didn't do anything. The .bak file has no numbers assigned and when I ran the awk command that was suggested it didn't make any changes or add numbers to the output file.

                Thanks rnaeye. The original file has a sequence #5 which is of two lines. The code is making the second line of the sequence as sequence #6 in the output. I probably need to make changes to the number of characters per line on the original file. Please advise regarding the same.

                The following are the input and output files:

                INPUT-
                >APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                >ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                >NSVNPDVSQHSPERHFHTSEGTLC
                >AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                >VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHY
                KEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                >MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                OUTPUT-
                >1
                APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                >2
                ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                >3
                NSVNPDVSQHSPERHFHTSEGTLC
                >4
                AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                >5
                VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHY
                >6
                KEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                >7
                MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                Comment


                • #9
                  Thanks dpryan - the third code works effectively but it skips numbers for a sequence following the one which has it on two lines; say sequence #5 has two lines for which the output would be >5 followed by >7, skipping >6. This explains better:

                  >4
                  AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                  >5
                  VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHY
                  KEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                  >7
                  MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                  I can live with it for now. I'll follow your advice and try to familiarize with the command line. Could you please fix the bug in the third code and let me know.....

                  Comment


                  • #10
                    Thanks ALL - I however have the issue with numbering sequences in order; removed the line delimiter and finally have the output file as:

                    >1
                    APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                    >2
                    ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                    >3
                    NSVNPDVSQHSPERHFHTSEGTLC
                    >4
                    AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                    >5
                    VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHYKEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                    >7
                    MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                    Please help me fix the issue of numbering sequences in order.......

                    Comment


                    • #11
                      That's less a bug than a feature request, but in any case it's pretty trivial to add support for multi-line entries:

                      Code:
                      cat foo | awk '{if(substr($1,1,1)==">"){idx+=1;sub(/>/,sprintf(">%i\n",idx),$1);}print $1}'

                      Comment


                      • #12
                        Finally.......it all looks good !

                        >1
                        APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                        >2
                        ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                        >3
                        NSVNPDVSQHSPERHFHTSEGTLC
                        >4
                        AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                        >5
                        VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHYKEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                        >6
                        MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG
                        >7
                        MADPDEVIPTVRDVSDAPFVGSDGSNVILNEDSFGGGDNGLEEFRGEGSMGK

                        Thank You all for your time !

                        Comment


                        • #13
                          concise mode:

                          Code:
                          cat input |  awk '/^>/{$1=">"++n"\n"substr($1,2)}1'

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Recent Advances in Sequencing Analysis Tools
                            by seqadmin


                            The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                            05-06-2024, 07:48 AM
                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 05-10-2024, 06:35 AM
                          0 responses
                          15 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 05-09-2024, 02:46 PM
                          0 responses
                          21 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 05-07-2024, 06:57 AM
                          0 responses
                          18 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 05-06-2024, 07:17 AM
                          0 responses
                          19 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X