Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Perl script for annotating DEGseq results

    Hi all,

    My output files from DEGseq and edgeR all contain differentially expressed GeneID's from a gff file and I would like a script that returns the "product=" annotation of those GeneID's from the gff file.

    Anyone have a script for this already?

  • #2
    Since gff files can vary, can you post a few lines?

    Comment


    • #3
      Sorry I didn't add this before. It's actually a gff file that I've extracted only the CDS from using a perl script. First three lines:

      SL254 RefSeq CDS 337 2796 . + 0 ID=SL254:thrA:unknown_transcript_1;Parent=SL254:thrA;locus_tag=SNSL254_A0002;EC_number=2.7.2.4;EC_number=1.1.1.3;note=multifunctional homotetrameric enzyme that catalyzes the phosphorylation of aspartate to form aspartyl-4-phosphate as well as conversion of aspartate semialdehyde to homoserine%3B functions in a number of amino acid biosynthetic pathways;transl_table=11;product=bifunctional aspartokinase I%2Fhomoserine dehydrogenase I;protein_id=YP_002039229.1;db_xref=GI:194446390;db_xref=GeneID:6487056;exon_number=1
      SL254 RefSeq CDS 2801 3727 . + 0 ID=SL254:thrB:unknown_transcript_1;Parent=SL254:thrB;locus_tag=SNSL254_A0003;EC_number=2.7.1.39;note=catalyzes the formation of O-phospho-L-homoserine from L-homoserine in threonine biosynthesis from asparate;transl_table=11;product=homoserine kinase;protein_id=YP_002039230.1;db_xref=GI:194444692;db_xref=GeneID:6486461;exon_number=1
      SL254 RefSeq CDS 3734 5017 . + 0 ID=SL254:thrC:unknown_transcript_1;Parent=SL254:thrC;locus_tag=SNSL254_A0004;EC_number=4.2.3.1;note=catalyzes the formation of L-threonine from O-phospho-L-homoserine;transl_table=11;product=threonine synthase;protein_id=YP_002039231.1;db_xref=GI:194444103;db_xref=GeneID:6484714;exon_number=1

      Comment


      • #4
        It's not perl, but this worked for me:

        cut -f9 your_file.gff | awk -F\; '{print $(NF-4)}'

        Comment


        • #5
          That's great! Could you explain how that works? Also how to write the results to a file and possibly make the output 2 columns, one column would be "GeneID" and one column would be "product"?

          Comment


          • #6
            This will get the GeneID and the product fields and output them as a tab-separated file.

            cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-1),$(NF-4)}' > output_file

            cut -f9 takes the 9th field in your_file.gff.
            "|" pipes (sends) the output to the next command.
            Under awk, the -F\; option tells it the input has ";" as a field separator. OFS="\t" makes the output a tab delimited file. print $(NF-1) prints the field to the left of the last field separator, which is the one containing the geneID. print $(NF-4) prints the field to the left of the fourth field separator from the end of the line, which is the one containing the product.

            There are a huge number of ways to parse out the data you want under unix or perl. This was just the first one that came to mind since I've been scripting in awk all morning...
            Last edited by pbluescript; 06-15-2012, 10:47 AM. Reason: Edited based on GenoMax's suggestion.

            Comment


            • #7
              Small extension as noted below to capture the output in a file.

              Originally posted by pbluescript View Post
              This will get the GeneID and the product fields and output them as a tab-separated file.

              cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-1),$(NF-4)}' > output_file_name

              Comment


              • #8
                Originally posted by GenoMax View Post
                Small extension as noted below to capture the output in a file.
                Thank you! I overlooked that part.

                Comment


                • #9
                  Thank you. I am a microbiologist with a bit of coding experience, I am trying to learn perl on my own, but needed this quick fix because I am sitting on a mound of data.

                  Mahalo,

                  -K

                  Comment


                  • #10
                    Can I get awk to print everything to the right of say "product="? or between "product=" and ";"?

                    mahalo

                    Comment


                    • #11
                      Originally posted by umnklang View Post
                      Can I get awk to print everything to the right of say "product="? or between "product=" and ";"?

                      mahalo
                      Isn't between "product=" and ";" what you already asked for?

                      Anyway, getting everything to the right of "product=" is just a variation of the code I already mentioned.

                      Code:
                      cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-4),$(NF-3),$(NF-2),$(NF-1),$(NF-0)}'

                      Comment


                      • #12
                        Well the above code does work great, but only for those annotations that have the "product" in the 4th position from the end for every CDS. Sometimes the annotations differ between CDS and the "translation table", for example, is on the 4th position for some CDS's within the same gff file.

                        Comment


                        • #13
                          Originally posted by umnklang View Post
                          Well the above code does work great, but only for those annotations that have the "product" in the 4th position from the end for every CDS. Sometimes the annotations differ between CDS and the "translation table", for example, is on the 4th position for some CDS's within the same gff file.
                          Ah... that would be an issue then.

                          Here's a couple ways that worked for me.
                          Sticking with the cut/awk strategy, but changing the FS to "product=". You are essentially using your search term as a delimiter and then getting everything after that.

                          Code:
                          cut -f9 file.gff | awk '{FS="product="} {print "product="$(NF-0)}'
                          This works, but it is ugly.

                          Here's a better way:
                          Code:
                          sed "s/.*product=/product=/" file.gff

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          31 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          32 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X