Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Trouble with RG line FGponce Bioinformatics 0 01-30-2013 11:57 PM
How to print out mapped data in SHRiMP2 smallfish Bioinformatics 0 12-14-2012 11:31 AM
Importing and processing data in R line by line gwilymh General 3 12-11-2012 06:22 AM
Can blastp (blast+) print a line for seqs with no hits when using -outfmt 6? kmkocot Bioinformatics 3 07-10-2012 10:10 AM
Bowtie print read length scirocco Bioinformatics 1 01-16-2009 05:43 PM

Thread Tools
Old 04-14-2013, 04:20 PM   #1
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default print first occurence of a line

Hi all,

I extracted ORFs from a initial fasta file and now I want to get the longest ORF for each transcript.

After having extracted the size of the ORFs with faSize and sorted them by size, the code I was used to use is:

perl -ane'print unless $x{$F[0]}++'
This time I have a problem using the perl command.

After having extracted the size and sorted the transcripts I have something like this:

    Singlet_1000_61 3844

    Singlet_2000_73 3508

    Singlet_1000_62 3081

    Singlet_2000_62 3008

    Singlet_3500_48 2973

    Singlet_4000_48 2964

    Singlet_3500_54 2863

What I want is:

    Singlet_1000_61 3844

    Singlet_2000_73 3508

    Singlet_3500_48 2973

The perl command is not working in this case.

Do you have any suggestions on how I can make it work?

Or a awk command?

Thanks for help
uqfgaiti is offline   Reply With Quote
Old 04-15-2013, 12:24 AM   #2
Senior Member
Location: .

Join Date: Mar 2011
Posts: 157

Try, they love this sort of thing, but put your thick-skin on.
bruce01 is offline   Reply With Quote
Old 04-15-2013, 06:43 AM   #3
Rick Westerman
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104

Or, as a hint, use a hash to keep track of if a transcript has been printed or not. Hashes are a wonderful data structure to know about.
westerman is offline   Reply With Quote
Old 04-16-2013, 04:59 AM   #4
Just a member
Location: Southern EU

Join Date: Nov 2012
Posts: 103

I don't get it. What do the numbers in your input refer to? If the first one is the ID of the transcript and the last one the length I would do

tac list | awk -F "_" '{t[$2]=$0}END{for (i in t)print t[i]}'
If the ID of each transcript is the first column then it is even simpler

tac list | awk '{t[$1]=$0}END{for (i in t)print t[i]}'
Pipe sort the output if needed.
Since you already sorted them by size the script starts by the end (tac) and only remembers the last occurrence of each ID.
syfo is offline   Reply With Quote

awk, bash, fasta file, perl

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 04:28 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO