SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trouble with RG line FGponce Bioinformatics 0 01-30-2013 11:57 PM
How to print out mapped data in SHRiMP2 smallfish Bioinformatics 0 12-14-2012 11:31 AM
Importing and processing data in R line by line gwilymh General 3 12-11-2012 06:22 AM
Can blastp (blast+) print a line for seqs with no hits when using -outfmt 6? kmkocot Bioinformatics 3 07-10-2012 10:10 AM
Bowtie print read length scirocco Bioinformatics 1 01-16-2009 05:43 PM

Reply
 
Thread Tools
Old 04-14-2013, 04:20 PM   #1
uqfgaiti
Member
 
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default print first occurence of a line

Hi all,

I extracted ORFs from a initial fasta file and now I want to get the longest ORF for each transcript.

After having extracted the size of the ORFs with faSize and sorted them by size, the code I was used to use is:

Code:
perl -ane'print unless $x{$F[0]}++'
This time I have a problem using the perl command.

After having extracted the size and sorted the transcripts I have something like this:

Code:
    Singlet_1000_61 3844

    Singlet_2000_73 3508

    Singlet_1000_62 3081

    Singlet_2000_62 3008

    Singlet_3500_48 2973

    Singlet_4000_48 2964

    Singlet_3500_54 2863

What I want is:

    Singlet_1000_61 3844

    Singlet_2000_73 3508

    Singlet_3500_48 2973
...

The perl command is not working in this case.

Do you have any suggestions on how I can make it work?

Or a awk command?

Thanks for help
uqfgaiti is offline   Reply With Quote
Old 04-15-2013, 12:24 AM   #2
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Try stackoverflow.com, they love this sort of thing, but put your thick-skin on.
bruce01 is offline   Reply With Quote
Old 04-15-2013, 06:43 AM   #3
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Or, as a hint, use a hash to keep track of if a transcript has been printed or not. Hashes are a wonderful data structure to know about.
westerman is offline   Reply With Quote
Old 04-16-2013, 04:59 AM   #4
syfo
Just a member
 
Location: Southern EU

Join Date: Nov 2012
Posts: 103
Default

I don't get it. What do the numbers in your input refer to? If the first one is the ID of the transcript and the last one the length I would do

Code:
tac list | awk -F "_" '{t[$2]=$0}END{for (i in t)print t[i]}'
If the ID of each transcript is the first column then it is even simpler

Code:
tac list | awk '{t[$1]=$0}END{for (i in t)print t[i]}'
Pipe sort the output if needed.
Since you already sorted them by size the script starts by the end (tac) and only remembers the last occurrence of each ID.
syfo is offline   Reply With Quote
Reply

Tags
awk, bash, fasta file, perl

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO