Seqanswers Leaderboard Ad

**WhatsOEver** · 11-04-2014, 12:44 AM

If I understand it correct, you just want to get the fasta sequences from your hits, don't you?

In this case, you can use the NCBI toolkit to do it. Have a look at this thread http://seqanswers.com/forums/showthread.php?t=8805 where it is explained.

**rhinoceros** · 11-04-2014, 02:27 AM

I would use tabular blast output and then parse the hit ids and extract the seqs from your custom db with blastdbcmd. I imagine a very simple for loop would suffice..

**bioBob** · 11-04-2014, 05:23 AM

I would echo rhinoceros' answer. Although I would use tgicl and the cdbfasta/cdbyank programs. I would take your query fasta file, convert that to tab format, chunk your file into 200 subfiles using linux split. Convert each of those back to fasta. Run blast in parallel on each query subfile using blast output as tab format for easy parsing. Then push that through the scripts you have. And finally use cdbyank to get the sequences of interest.

**distroboto** · 11-04-2014, 06:03 AM

Hi gang, thanks for the quick reply.

The problem is a bit more complex than just being able to parse out my sequences. I need to be able to keep my query sequence binned with my hit sequences. In the blast output, what I get is my query ID in one column, and the hit ID in the adjacent column. The hits are easy to keep together, but the tab output doesnt make for easy parsing of the 6 IDs in one group.

If I had 5 hits for sure for every fasta sequence in my query file, I would try to write a script that extracts the ID every X lines, and groups them with the adjacent 5 lines of hits, but some fasta sequences in my query files have no hits, or only one or two. Because there are so many genes in this file, it needs to be automated rather than me goin in and cutting and pasting into a new file.

Does anyone know how to parse out the sequences (or IDS that I could then fetch the sequences with) through a method that allows me to group my query and hits together on the same line?

Biobob, Im not sure I understand your response, is the chunking the file into 200 subfiles to save on computing, or does this somehow change the format of the output and Im not following.

Thanks again community!

**bioBob** · 11-04-2014, 06:12 AM

Hi, the chunking would save on compute time as I had thought that was your main issue.

What format do you need as input into your scripts?

query 1, hit 1, hit 2, ...
query 2, hit 1, hit 2, ...

?

Seems like awk would do the trick.

awk '{x[$1]=x[$1]","$2}END{for(j in x) print x[j] }' blast_hits.tab

**rhinoceros** · 11-04-2014, 06:12 AM

Originally posted by distroboto View Post

Hi gang, thanks for the quick reply.

The problem is a bit more complex than just being able to parse out my sequences. I need to be able to keep my query sequence binned with my hit sequences. In the blast output, what I get is my query ID in one column, and the hit ID in the adjacent column. The hits are easy to keep together, but the tab output doesnt make for easy parsing of the 6 IDs in one group.

Why not? Assuming you have limited the number of hits to your desired max:

Code:

cut -f1 blastOutput | sort -u > queryIDs

for next in $(cat queryIDs)
do
grep -w $next blastOutput | cut -f2 >> $next.hitIds
done

..then onto blastdbcmd with your hitIds.

**distroboto** · 11-06-2014, 06:43 AM

OK, so I figured out how to run this code (as a newbie, that was actually surprisngly hard..and then simple!)

(saved as text file, then ran bash textfile.txt)

I am getting the following error:

grep: brackets ([ ]) not balanced

Can I also confirm that the variables: (this is for me to try to understand the code and learn a bit here) :

BlastOutput means the name of my blastp output file.
-f1 means column one; f2 means column 2

and queryIDs is a temporary variable? given to the values it finds in column one.

In which case this code is saying "hey look in column 1, cut the query ID, sort by query IDs, everytime you encounter a new one, cut column 2 values?

**bioBob** · 11-06-2014, 06:51 AM

I think we are assuming you are at a linux terminal.

If you have a blast table output file named blastOutput, to try to awk command I suggested type it out as
awk '{x[$1]=x[$1]","$2}END{for(j in x) print x[j] }' blastOutput

If my command works, you should see a bunch of lines scroll past that look like:
query1,hit1,hit2,etc
query2,hit1,hit2,etc

If you want to push that to a file called blastOutput.simplified, do:
awk '{x[$1]=x[$1]","$2}END{for(j in x) print x[j] }' blastOutput >blastOutput.simplified

Similar comments to what Rinoceros posted.

**distroboto** · 11-06-2014, 06:56 AM

Originally posted by bioBob View Post

Hi, the chunking would save on compute time as I had thought that was your main issue.

What format do you need as input into your scripts?

query 1, hit 1, hit 2, ...
query 2, hit 1, hit 2, ...

?

Seems like awk would do the trick.

awk '{x[$1]=x[$1]","$2}END{for(j in x) print x[j] }' blast_hits.tab

Hi biobob,

This format would be good once all of my scripts had finished, but before I get there I will need to pull the fastas and analyze them, and as far as I can tell, the format to be able to do that easily is all on one column.

ie:

query1
hit1
hit2
hit3
hit4
query2
hit1
hit2
hit3
hit4
hit5

then pull fastas to same line, THEN run my scripts and get values, THEN use awk so as to be able to get ratio values b/w my hits and querys (the ultimate goal)

thanks so much for all the help- I am learning a ton, and also super impressed and thrilled with this community.

**bioBob** · 11-06-2014, 07:17 AM

Ok, easy enough. replace the "," in the awk with "\n"

Not sure what you mean by pull fastas to the same line but the above should get you to the format you were hoping for. +/- an error in my awk script.

Are you needing the next step to be a single fasta file organized in the same way? Just cat (combine using the linux cat command) the query and subject fasta files together into a single fasta file, then use cdbfasta/cdbyank or the blast tools rhinoceros mentioned on the output of the above awk (cat awk_output | cdbyank combined_fasta.cidx -A >query_hit.fasta) need to look at the flags, I probably wrote the wrong one.

Good luck

**distroboto** · 11-06-2014, 08:33 AM

hi biobob!

ok, now I'm getting somewhere- the awk commands work, but not exactly as I need them to:

its only printing the hit column, but not the query ID. ie my oputput looks like

hit1
hit2
hit3

hit1
hit2
hit3
hit4 etc

ideally I would like:

query1
hit1
hit2
hit3

so far my tweaks to the script only result in me printing out the query numerous times (as many times as I have hits)

.
Im trying to write something that says :

print out first query hit ($1), print out adjacent ($2)
if next line $1 is equal to previous, print out the adjacent ($2)
if not
print out the query2 ($1)m print out adjacent ($2)

at least, this is what my newborn programming brain is telling me think like, I just dont have the entire toolkit yet to script that out.

**bioBob** · 11-06-2014, 08:49 AM

My bad. I should always test code before posting.

Do:
awk '{x[$1]=x[$1]"\n"$2}END{for(j in x) print j,x[j]}' blast_file

You were correct, you could have added an if in there to get it as well.

awk '{if(!x[$1]){x[$1]=$1}else{x[$1]=x[$1]"\n"$2}}END{for(j in x) print x[j]}' blast_file

**rhinoceros** · 11-06-2014, 11:53 AM

Originally posted by distroboto View Post

OK, so I figured out how to run this code (as a newbie, that was actually surprisngly hard..and then simple!)

(saved as text file, then ran bash textfile.txt)

I am getting the following error:

grep: brackets ([ ]) not balanced

Can I also confirm that the variables: (this is for me to try to understand the code and learn a bit here) :

BlastOutput means the name of my blastp output file.
-f1 means column one; f2 means column 2

and queryIDs is a temporary variable? given to the values it finds in column one.

In which case this code is saying "hey look in column 1, cut the query ID, sort by query IDs, everytime you encounter a new one, cut column 2 values?

Hey,

Cut the first column of your assumed tabular blast output file (blastOutput), sort for unique values and print them to a file called queryIDs.

Code:

cut -f1 blastOutput | sort -u > queryIDs

So e.g.

Query1
Query1
Query2
Query3
Query3

Would become:

Query1
Query2
Query3

And then..

Code:

for next in $(cat queryIDs)
do
grep -w $next blastOutput | cut -f2 >> $next.hitIds
done

Loop through the queryIDs file, find exact match(es) from the blastOutput file, take the second column, and print them to a file that has the same name as the query it's looking (+ .hitIds extension).

I don't know why you're getting that grep error.

Code:

grep -w "^$next" ...

Would actually be better so it would look for lines starting with exact matches. You could also put

Code:

#!/bin/bash
IFS=$'\n'

to the top of the file if you want to run it as such, so:

Code:

#!/bin/bash
IFS=$'\n'
cut -f1 blastOutput | sort -u > queryIDs

for next in $(cat queryIDs)
do
grep -w "^$next" blastOutput | cut -f2 >> $next.hitIds
done

And remember, tabular blast output (outfmt 6) is assumed for this to work..

Also,

man grep
man sort
man cut
man whatever

man pages are truly awesome..

**distroboto** · 11-06-2014, 01:26 PM

This is fun! (really though)

So- Biobob- theres some weird happenings with this script- it isnt parsing them out in any order that I can discern. Computers dont do that- so I must be overlooking something.

Its not doing
query1
hit1
hit2
hit3
query2
hit1
hit2

etc
but rather

query20
hit1
query60
hit1
hit2
hit3
query71
hit1
hit2
hit3

I looked at whether the hitIDs were in ascending or descending order, but that doesn't seem to be it. Ah!

The "if" alternative does the same wonky order, but also in the first instance of hit seeing query1 only prints column 1, and not column 2, (so it misses one but not column 1)

I can't tell if its something wonky in my input file, but its the output of the blasp run, ive tried saving it in various types of txt, and also manually checked to see if there were weird invisible characters, but cant find anything. Ahh!

@rhinoceros, many thanks. One of the most interesting things thus far is seeing multiple ways of doing the same thing!!
Your script is cool- prints out a billion files with the file name being the query id, and the contents being the hit id. How would I get it to also print the queryid in the contents of the file? (I can concatenate all of these files later)

Thanks for the help- I'm finding a lot of the learning is coming from things not working- and then seeing how tweaking various things make it work right!

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

FANCY BLAST PARSING NEEDED- extract fasta sequences

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News