Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Perl script for annotating DEGseq results

    Hi all,

    My output files from DEGseq and edgeR all contain differentially expressed GeneID's from a gff file and I would like a script that returns the "product=" annotation of those GeneID's from the gff file.

    Anyone have a script for this already?

  • #2
    Since gff files can vary, can you post a few lines?

    Comment


    • #3
      Sorry I didn't add this before. It's actually a gff file that I've extracted only the CDS from using a perl script. First three lines:

      SL254 RefSeq CDS 337 2796 . + 0 ID=SL254:thrA:unknown_transcript_1;Parent=SL254:thrA;locus_tag=SNSL254_A0002;EC_number=2.7.2.4;EC_number=1.1.1.3;note=multifunctional homotetrameric enzyme that catalyzes the phosphorylation of aspartate to form aspartyl-4-phosphate as well as conversion of aspartate semialdehyde to homoserine%3B functions in a number of amino acid biosynthetic pathways;transl_table=11;product=bifunctional aspartokinase I%2Fhomoserine dehydrogenase I;protein_id=YP_002039229.1;db_xref=GI:194446390;db_xref=GeneID:6487056;exon_number=1
      SL254 RefSeq CDS 2801 3727 . + 0 ID=SL254:thrB:unknown_transcript_1;Parent=SL254:thrB;locus_tag=SNSL254_A0003;EC_number=2.7.1.39;note=catalyzes the formation of O-phospho-L-homoserine from L-homoserine in threonine biosynthesis from asparate;transl_table=11;product=homoserine kinase;protein_id=YP_002039230.1;db_xref=GI:194444692;db_xref=GeneID:6486461;exon_number=1
      SL254 RefSeq CDS 3734 5017 . + 0 ID=SL254:thrC:unknown_transcript_1;Parent=SL254:thrC;locus_tag=SNSL254_A0004;EC_number=4.2.3.1;note=catalyzes the formation of L-threonine from O-phospho-L-homoserine;transl_table=11;product=threonine synthase;protein_id=YP_002039231.1;db_xref=GI:194444103;db_xref=GeneID:6484714;exon_number=1

      Comment


      • #4
        It's not perl, but this worked for me:

        cut -f9 your_file.gff | awk -F\; '{print $(NF-4)}'

        Comment


        • #5
          That's great! Could you explain how that works? Also how to write the results to a file and possibly make the output 2 columns, one column would be "GeneID" and one column would be "product"?

          Comment


          • #6
            This will get the GeneID and the product fields and output them as a tab-separated file.

            cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-1),$(NF-4)}' > output_file

            cut -f9 takes the 9th field in your_file.gff.
            "|" pipes (sends) the output to the next command.
            Under awk, the -F\; option tells it the input has ";" as a field separator. OFS="\t" makes the output a tab delimited file. print $(NF-1) prints the field to the left of the last field separator, which is the one containing the geneID. print $(NF-4) prints the field to the left of the fourth field separator from the end of the line, which is the one containing the product.

            There are a huge number of ways to parse out the data you want under unix or perl. This was just the first one that came to mind since I've been scripting in awk all morning...
            Last edited by pbluescript; 06-15-2012, 10:47 AM. Reason: Edited based on GenoMax's suggestion.

            Comment


            • #7
              Small extension as noted below to capture the output in a file.

              Originally posted by pbluescript View Post
              This will get the GeneID and the product fields and output them as a tab-separated file.

              cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-1),$(NF-4)}' > output_file_name

              Comment


              • #8
                Originally posted by GenoMax View Post
                Small extension as noted below to capture the output in a file.
                Thank you! I overlooked that part.

                Comment


                • #9
                  Thank you. I am a microbiologist with a bit of coding experience, I am trying to learn perl on my own, but needed this quick fix because I am sitting on a mound of data.

                  Mahalo,

                  -K

                  Comment


                  • #10
                    Can I get awk to print everything to the right of say "product="? or between "product=" and ";"?

                    mahalo

                    Comment


                    • #11
                      Originally posted by umnklang View Post
                      Can I get awk to print everything to the right of say "product="? or between "product=" and ";"?

                      mahalo
                      Isn't between "product=" and ";" what you already asked for?

                      Anyway, getting everything to the right of "product=" is just a variation of the code I already mentioned.

                      Code:
                      cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-4),$(NF-3),$(NF-2),$(NF-1),$(NF-0)}'

                      Comment


                      • #12
                        Well the above code does work great, but only for those annotations that have the "product" in the 4th position from the end for every CDS. Sometimes the annotations differ between CDS and the "translation table", for example, is on the 4th position for some CDS's within the same gff file.

                        Comment


                        • #13
                          Originally posted by umnklang View Post
                          Well the above code does work great, but only for those annotations that have the "product" in the 4th position from the end for every CDS. Sometimes the annotations differ between CDS and the "translation table", for example, is on the 4th position for some CDS's within the same gff file.
                          Ah... that would be an issue then.

                          Here's a couple ways that worked for me.
                          Sticking with the cut/awk strategy, but changing the FS to "product=". You are essentially using your search term as a delimiter and then getting everything after that.

                          Code:
                          cut -f9 file.gff | awk '{FS="product="} {print "product="$(NF-0)}'
                          This works, but it is ugly.

                          Here's a better way:
                          Code:
                          sed "s/.*product=/product=/" file.gff

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Advancing Precision Medicine for Rare Diseases in Children
                            by seqadmin




                            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                            12-16-2024, 07:57 AM
                          • seqadmin
                            Recent Advances in Sequencing Technologies
                            by seqadmin



                            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                            Long-Read Sequencing
                            Long-read sequencing has seen remarkable advancements,...
                            12-02-2024, 01:49 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 12-17-2024, 10:28 AM
                          0 responses
                          26 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-13-2024, 08:24 AM
                          0 responses
                          43 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-12-2024, 07:41 AM
                          0 responses
                          29 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-11-2024, 07:45 AM
                          0 responses
                          42 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X