Seqanswers Leaderboard Ad

**pbluescript** · 06-15-2012, 02:36 AM

Since gff files can vary, can you post a few lines?

**umnklang** · 06-15-2012, 07:19 AM

Sorry I didn't add this before. It's actually a gff file that I've extracted only the CDS from using a perl script. First three lines:

SL254 RefSeq CDS 337 2796 . + 0 ID=SL254:thrA:unknown_transcript_1;Parent=SL254:thrA;locus_tag=SNSL254_A0002;EC_number=2.7.2.4;EC_number=1.1.1.3;note=multifunctional homotetrameric enzyme that catalyzes the phosphorylation of aspartate to form aspartyl-4-phosphate as well as conversion of aspartate semialdehyde to homoserine%3B functions in a number of amino acid biosynthetic pathways;transl_table=11;product=bifunctional aspartokinase I%2Fhomoserine dehydrogenase I;protein_id=YP_002039229.1;db_xref=GI:194446390;db_xref=GeneID:6487056;exon_number=1
SL254 RefSeq CDS 2801 3727 . + 0 ID=SL254:thrB:unknown_transcript_1;Parent=SL254:thrB;locus_tag=SNSL254_A0003;EC_number=2.7.1.39;note=catalyzes the formation of O-phospho-L-homoserine from L-homoserine in threonine biosynthesis from asparate;transl_table=11;product=homoserine kinase;protein_id=YP_002039230.1;db_xref=GI:194444692;db_xref=GeneID:6486461;exon_number=1
SL254 RefSeq CDS 3734 5017 . + 0 ID=SL254:thrC:unknown_transcript_1;Parent=SL254:thrC;locus_tag=SNSL254_A0004;EC_number=4.2.3.1;note=catalyzes the formation of L-threonine from O-phospho-L-homoserine;transl_table=11;product=threonine synthase;protein_id=YP_002039231.1;db_xref=GI:194444103;db_xref=GeneID:6484714;exon_number=1

**pbluescript** · 06-15-2012, 07:36 AM

It's not perl, but this worked for me:

cut -f9 your_file.gff | awk -F\; '{print $(NF-4)}'

**umnklang** · 06-15-2012, 08:27 AM

That's great! Could you explain how that works? Also how to write the results to a file and possibly make the output 2 columns, one column would be "GeneID" and one column would be "product"?

**pbluescript** · 06-15-2012, 08:50 AM

This will get the GeneID and the product fields and output them as a tab-separated file.

cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-1),$(NF-4)}' > output_file

cut -f9 takes the 9th field in your_file.gff.
"|" pipes (sends) the output to the next command.
Under awk, the -F\; option tells it the input has ";" as a field separator. OFS="\t" makes the output a tab delimited file. print $(NF-1) prints the field to the left of the last field separator, which is the one containing the geneID. print $(NF-4) prints the field to the left of the fourth field separator from the end of the line, which is the one containing the product.

There are a huge number of ways to parse out the data you want under unix or perl. This was just the first one that came to mind since I've been scripting in awk all morning...

**GenoMax** · 06-15-2012, 09:02 AM

Small extension as noted below to capture the output in a file.

Originally posted by pbluescript View Post

This will get the GeneID and the product fields and output them as a tab-separated file.

cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-1),$(NF-4)}' > output_file_name

**pbluescript** · 06-15-2012, 09:03 AM

Originally posted by GenoMax View Post

Small extension as noted below to capture the output in a file.

Thank you! I overlooked that part.

**umnklang** · 06-15-2012, 10:45 AM

Thank you. I am a microbiologist with a bit of coding experience, I am trying to learn perl on my own, but needed this quick fix because I am sitting on a mound of data.

Mahalo,

-K

**umnklang** · 06-26-2012, 12:08 PM

Can I get awk to print everything to the right of say "product="? or between "product=" and ";"?

mahalo

**pbluescript** · 06-27-2012, 05:22 AM

Originally posted by umnklang View Post

Can I get awk to print everything to the right of say "product="? or between "product=" and ";"?

mahalo

Isn't between "product=" and ";" what you already asked for?

Anyway, getting everything to the right of "product=" is just a variation of the code I already mentioned.

Code:

cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-4),$(NF-3),$(NF-2),$(NF-1),$(NF-0)}'

**umnklang** · 06-27-2012, 05:35 AM

Well the above code does work great, but only for those annotations that have the "product" in the 4th position from the end for every CDS. Sometimes the annotations differ between CDS and the "translation table", for example, is on the 4th position for some CDS's within the same gff file.

**pbluescript** · 06-27-2012, 06:34 AM

Originally posted by umnklang View Post

Well the above code does work great, but only for those annotations that have the "product" in the 4th position from the end for every CDS. Sometimes the annotations differ between CDS and the "translation table", for example, is on the 4th position for some CDS's within the same gff file.

Ah... that would be an issue then.

Here's a couple ways that worked for me.
Sticking with the cut/awk strategy, but changing the FS to "product=". You are essentially using your search term as a delimiter and then getting everything after that.

Code:

cut -f9 file.gff | awk '{FS="product="} {print "product="$(NF-0)}'

This works, but it is ugly.

Here's a better way:

Code:

sed "s/.*product=/product=/" file.gff

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Perl script for annotating DEGseq results

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News