SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Parsing multi fasta sequence file using Perl newbie2this Bioinformatics 9 09-11-2013 05:48 AM
Parsing file to retrieve specific pattern Shishir Bioinformatics 3 07-02-2013 07:29 AM
Parsing sam file for %id + %length cutoff jmartin Bioinformatics 0 06-26-2013 12:03 PM
Parsing a VCF file for indels bnfoguy Bioinformatics 1 07-12-2012 11:19 AM
Dindel crash whilst parsing bam-sorted file michalkovac Bioinformatics 8 02-16-2012 07:56 PM

Reply
 
Thread Tools
Old 09-06-2013, 04:45 AM   #1
Shishir
Member
 
Location: Germany

Join Date: Nov 2012
Posts: 22
Default parsing the gff file

Hi all,

I have a gff file like:

gn|nvit|C3905550 assmcg CDS 68 646 . - . asmbl_401
gn|nvit|C3905550 assmcg exon 68 646 . - . asmbl_401
gn|nvit|C3918365 assmcg CDS 42 252 . + . asmbl_443
gn|nvit|C3918365 assmcg CDS 522 705 . + . asmbl_443
gn|nvit|C3918365 assmcg exon 522 705 . + . asmbl_443
gn|nvit|C3930535 assmcg exon 64 888 . - . asmbl_465
gn|nvit|C3930535 assmcg three_prime_utr 64 393 . - . asmbl_465
gn|nvit|C3930535 assmcg CDS 394 699 . - . asmbl_465
gn|nvit|C3930535 assmcg five_prime_utr 700 888 . - . asmbl_465
gn|nvit|C3935122 assmcg exon 4 567 . + . asmbl_476
gn|nvit|C3938828 assmcg CDS 293 745 . + . asmbl_481
gn|nvit|C3938828 assmcg exon 293 745 . + . asmbl_481
gn|nvit|C3942486 assmcg CDS 244 942 . - . asmbl_489
gn|nvit|C3942486 assmcg exon 244 942 . - . asmbl_489
gn|nvit|C3950921 assmcg exon 40 80 . + . asmbl_506
gn|nvit|C3950921 assmcg three_prime_utr 40 80 . + . asmbl_506
gn|nvit|C3950921 assmcg exon 172 253 . + . asmbl_506
gn|nvit|C3950921 assmcg five_prime_utr 172 190 . + . asmbl_506


I want to create a list writing all the 'transcript id that have both the three_prime_utr and five_prime_utr corndinates' like
asmbl_465
asmbl_506

I used cat final1.gff | perl -ne 's/.*\t(\S+_prime_utr)\t.*transcript_id \"(\S+)\".*/$2\t$1/; print;' | sort -u | perl -ne 'split; print "$_[0]\n" if ($g eq $_[0]); $g = $_[0];' > myutr_list.txt
but it did not worked for me.

Thanks!

Last edited by Shishir; 09-06-2013 at 04:48 AM.
Shishir is offline   Reply With Quote
Old 09-06-2013, 05:34 AM   #2
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Hi- Try this one:

Code:
grep -E 'five_prime_utr|three_prime_utr' final1.gff \
    | cut -f 3,9 \
    | sort -k2,2 -k 1,1 -u \
    | cut -f 2 \
    | uniq -c \
    | awk '{if($1 == 2) print $2}'
1st line: Get lines with either utr
2st: Get columns with feature type and gene id
3rd: Get unique lines (now each gene has one line if it has 3UTR OR 5UTR, two lines if it has both)
4th: Get only the column of gene name
5th: Count how many times the gene name is found
6th: If found two times it must have both UTRs so print it

See if it works...

Dario
dariober is offline   Reply With Quote
Old 09-06-2013, 06:03 AM   #3
Shishir
Member
 
Location: Germany

Join Date: Nov 2012
Posts: 22
Default

Many thanks! it worked for me.

Quote:
Originally Posted by dariober View Post
Hi- Try this one:

Code:
grep -E 'five_prime_utr|three_prime_utr' final1.gff \
    | cut -f 3,9 \
    | sort -k2,2 -k 1,1 -u \
    | cut -f 2 \
    | uniq -c \
    | awk '{if($1 == 2) print $2}'
1st line: Get lines with either utr
2st: Get columns with feature type and gene id
3rd: Get unique lines (now each gene has one line if it has 3UTR OR 5UTR, two lines if it has both)
4th: Get only the column of gene name
5th: Count how many times the gene name is found
6th: If found two times it must have both UTRs so print it

See if it works...

Dario
Shishir is offline   Reply With Quote
Reply

Tags
bash, bioinformatics, gff, linux, perl

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:00 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO