Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FANCY BLAST PARSING NEEDED- extract fasta sequences

    Hi there,

    Im hoping I can get some help from this forum as my programming skills arn't up to pr to tackle this.

    I am comparing one genome (bacterial) against a custom database of closely related strains, to then look at common point mutations/adaptations in homologous proteins to which should be reflective of where the isolates came from (this is to give context, I hope it made sense). In general the rough workflow of this is :

    1. blastp my genome against the custom database and get the top 5 hits
    2. compare parameters such as GC content, Arginine content, hydrophobic residues etc. between an ENTIRE protein in my genome of interest and its top 5 matches.
    3. Calculate which proteins are significantly different from their matches and in what way.


    I have the ability and scripts to do all these steps separately when running on a tester file of just 20 genes. However, my genome has 4000 + genes, and this simply must be automated.

    To do this, it seems critical that I must be able: to:
    a) bin my top 5 hits with the query sequence
    b) output the fasta files in a manner that can allow for downstream analysis and especially computation of these.




    To accomplish this; I think an output file of
    >Query_protein1_full sequence
    agagagagaga
    > Hit1_protein1_full sequence
    agagagag
    > Hit2_protein1_full sequence
    agagagag
    > Hit3_protein1_full sequence
    agagagag
    > Hit4_protein1_full sequence
    agagagag
    > Hit5_protein1_full sequence
    agagagag
    > Query_protein2_full sequence
    agagagag
    > Hit1_protein2_full sequence
    agagagag



    etc etc would work. Does anyone have any ideas on how I could script this out? This is above my programming skills, though I am learning.



    Cheers

  • #2
    If I understand it correct, you just want to get the fasta sequences from your hits, don't you?

    In this case, you can use the NCBI toolkit to do it. Have a look at this thread http://seqanswers.com/forums/showthread.php?t=8805 where it is explained.

    Comment


    • #3
      I would use tabular blast output and then parse the hit ids and extract the seqs from your custom db with blastdbcmd. I imagine a very simple for loop would suffice..
      savetherhino.org

      Comment


      • #4
        I would echo rhinoceros' answer. Although I would use tgicl and the cdbfasta/cdbyank programs. I would take your query fasta file, convert that to tab format, chunk your file into 200 subfiles using linux split. Convert each of those back to fasta. Run blast in parallel on each query subfile using blast output as tab format for easy parsing. Then push that through the scripts you have. And finally use cdbyank to get the sequences of interest.

        Comment


        • #5
          Hi gang, thanks for the quick reply.

          The problem is a bit more complex than just being able to parse out my sequences. I need to be able to keep my query sequence binned with my hit sequences. In the blast output, what I get is my query ID in one column, and the hit ID in the adjacent column. The hits are easy to keep together, but the tab output doesnt make for easy parsing of the 6 IDs in one group.

          If I had 5 hits for sure for every fasta sequence in my query file, I would try to write a script that extracts the ID every X lines, and groups them with the adjacent 5 lines of hits, but some fasta sequences in my query files have no hits, or only one or two. Because there are so many genes in this file, it needs to be automated rather than me goin in and cutting and pasting into a new file.


          Does anyone know how to parse out the sequences (or IDS that I could then fetch the sequences with) through a method that allows me to group my query and hits together on the same line?

          Biobob, Im not sure I understand your response, is the chunking the file into 200 subfiles to save on computing, or does this somehow change the format of the output and Im not following.



          Thanks again community!

          Comment


          • #6
            Hi, the chunking would save on compute time as I had thought that was your main issue.

            What format do you need as input into your scripts?

            query 1, hit 1, hit 2, ...
            query 2, hit 1, hit 2, ...


            ?

            Seems like awk would do the trick.

            awk '{x[$1]=x[$1]","$2}END{for(j in x) print x[j] }' blast_hits.tab

            Comment


            • #7
              Originally posted by distroboto View Post
              Hi gang, thanks for the quick reply.

              The problem is a bit more complex than just being able to parse out my sequences. I need to be able to keep my query sequence binned with my hit sequences. In the blast output, what I get is my query ID in one column, and the hit ID in the adjacent column. The hits are easy to keep together, but the tab output doesnt make for easy parsing of the 6 IDs in one group.
              Why not? Assuming you have limited the number of hits to your desired max:

              Code:
              cut -f1 blastOutput | sort -u > queryIDs
              
              for next in $(cat queryIDs)
              do
              grep -w $next blastOutput | cut -f2 >> $next.hitIds
              done
              ..then onto blastdbcmd with your hitIds.
              savetherhino.org

              Comment


              • #8
                OK, so I figured out how to run this code (as a newbie, that was actually surprisngly hard..and then simple!)

                (saved as text file, then ran bash textfile.txt)

                I am getting the following error:

                grep: brackets ([ ]) not balanced



                Can I also confirm that the variables: (this is for me to try to understand the code and learn a bit here) :

                BlastOutput means the name of my blastp output file.
                -f1 means column one; f2 means column 2

                and queryIDs is a temporary variable? given to the values it finds in column one.


                In which case this code is saying "hey look in column 1, cut the query ID, sort by query IDs, everytime you encounter a new one, cut column 2 values?
                Last edited by distroboto; 11-06-2014, 06:51 AM.

                Comment


                • #9
                  I think we are assuming you are at a linux terminal.

                  If you have a blast table output file named blastOutput, to try to awk command I suggested type it out as
                  awk '{x[$1]=x[$1]","$2}END{for(j in x) print x[j] }' blastOutput

                  If my command works, you should see a bunch of lines scroll past that look like:
                  query1,hit1,hit2,etc
                  query2,hit1,hit2,etc

                  If you want to push that to a file called blastOutput.simplified, do:
                  awk '{x[$1]=x[$1]","$2}END{for(j in x) print x[j] }' blastOutput >blastOutput.simplified

                  Similar comments to what Rinoceros posted.

                  Comment


                  • #10
                    Originally posted by bioBob View Post
                    Hi, the chunking would save on compute time as I had thought that was your main issue.

                    What format do you need as input into your scripts?

                    query 1, hit 1, hit 2, ...
                    query 2, hit 1, hit 2, ...


                    ?

                    Seems like awk would do the trick.

                    awk '{x[$1]=x[$1]","$2}END{for(j in x) print x[j] }' blast_hits.tab

                    Hi biobob,


                    This format would be good once all of my scripts had finished, but before I get there I will need to pull the fastas and analyze them, and as far as I can tell, the format to be able to do that easily is all on one column.

                    ie:

                    query1
                    hit1
                    hit2
                    hit3
                    hit4
                    query2
                    hit1
                    hit2
                    hit3
                    hit4
                    hit5

                    then pull fastas to same line, THEN run my scripts and get values, THEN use awk so as to be able to get ratio values b/w my hits and querys (the ultimate goal)



                    thanks so much for all the help- I am learning a ton, and also super impressed and thrilled with this community.

                    Comment


                    • #11
                      Ok, easy enough. replace the "," in the awk with "\n"

                      Not sure what you mean by pull fastas to the same line but the above should get you to the format you were hoping for. +/- an error in my awk script.

                      Are you needing the next step to be a single fasta file organized in the same way? Just cat (combine using the linux cat command) the query and subject fasta files together into a single fasta file, then use cdbfasta/cdbyank or the blast tools rhinoceros mentioned on the output of the above awk (cat awk_output | cdbyank combined_fasta.cidx -A >query_hit.fasta) need to look at the flags, I probably wrote the wrong one.

                      Good luck

                      Comment


                      • #12
                        hi biobob!

                        ok, now I'm getting somewhere- the awk commands work, but not exactly as I need them to:


                        its only printing the hit column, but not the query ID. ie my oputput looks like

                        hit1
                        hit2
                        hit3


                        hit1
                        hit2
                        hit3
                        hit4 etc




                        ideally I would like:

                        query1
                        hit1
                        hit2
                        hit3



                        so far my tweaks to the script only result in me printing out the query numerous times (as many times as I have hits) .
                        Im trying to write something that says :

                        print out first query hit ($1), print out adjacent ($2)
                        if next line $1 is equal to previous, print out the adjacent ($2)
                        if not
                        print out the query2 ($1)m print out adjacent ($2)



                        at least, this is what my newborn programming brain is telling me think like, I just dont have the entire toolkit yet to script that out.

                        Comment


                        • #13
                          My bad. I should always test code before posting.

                          Do:
                          awk '{x[$1]=x[$1]"\n"$2}END{for(j in x) print j,x[j]}' blast_file

                          You were correct, you could have added an if in there to get it as well.

                          awk '{if(!x[$1]){x[$1]=$1}else{x[$1]=x[$1]"\n"$2}}END{for(j in x) print x[j]}' blast_file

                          Comment


                          • #14
                            Originally posted by distroboto View Post
                            OK, so I figured out how to run this code (as a newbie, that was actually surprisngly hard..and then simple!)

                            (saved as text file, then ran bash textfile.txt)

                            I am getting the following error:

                            grep: brackets ([ ]) not balanced



                            Can I also confirm that the variables: (this is for me to try to understand the code and learn a bit here) :

                            BlastOutput means the name of my blastp output file.
                            -f1 means column one; f2 means column 2

                            and queryIDs is a temporary variable? given to the values it finds in column one.


                            In which case this code is saying "hey look in column 1, cut the query ID, sort by query IDs, everytime you encounter a new one, cut column 2 values?
                            Hey,

                            Cut the first column of your assumed tabular blast output file (blastOutput), sort for unique values and print them to a file called queryIDs.

                            Code:
                            cut -f1 blastOutput | sort -u > queryIDs
                            So e.g.

                            Query1
                            Query1
                            Query2
                            Query3
                            Query3

                            Would become:

                            Query1
                            Query2
                            Query3


                            And then..

                            Code:
                            for next in $(cat queryIDs)
                            do
                            grep -w $next blastOutput | cut -f2 >> $next.hitIds
                            done
                            Loop through the queryIDs file, find exact match(es) from the blastOutput file, take the second column, and print them to a file that has the same name as the query it's looking (+ .hitIds extension).

                            I don't know why you're getting that grep error.

                            Code:
                            grep -w "^$next" ...
                            Would actually be better so it would look for lines starting with exact matches. You could also put

                            Code:
                            #!/bin/bash
                            IFS=$'\n'
                            to the top of the file if you want to run it as such, so:

                            Code:
                            #!/bin/bash
                            IFS=$'\n'
                            cut -f1 blastOutput | sort -u > queryIDs
                            
                            for next in $(cat queryIDs)
                            do
                            grep -w "^$next" blastOutput | cut -f2 >> $next.hitIds
                            done
                            And remember, tabular blast output (outfmt 6) is assumed for this to work..


                            Also,

                            man grep
                            man sort
                            man cut
                            man whatever


                            man pages are truly awesome..
                            Last edited by rhinoceros; 11-06-2014, 12:00 PM.
                            savetherhino.org

                            Comment


                            • #15
                              This is fun! (really though)

                              So- Biobob- theres some weird happenings with this script- it isnt parsing them out in any order that I can discern. Computers dont do that- so I must be overlooking something.

                              Its not doing
                              query1
                              hit1
                              hit2
                              hit3
                              query2
                              hit1
                              hit2


                              etc
                              but rather



                              query20
                              hit1
                              query60
                              hit1
                              hit2
                              hit3
                              query71
                              hit1
                              hit2
                              hit3


                              I looked at whether the hitIDs were in ascending or descending order, but that doesn't seem to be it. Ah!


                              The "if" alternative does the same wonky order, but also in the first instance of hit seeing query1 only prints column 1, and not column 2, (so it misses one but not column 1)

                              I can't tell if its something wonky in my input file, but its the output of the blasp run, ive tried saving it in various types of txt, and also manually checked to see if there were weird invisible characters, but cant find anything. Ahh!



                              @rhinoceros, many thanks. One of the most interesting things thus far is seeing multiple ways of doing the same thing!!
                              Your script is cool- prints out a billion files with the file name being the query id, and the contents being the hit id. How would I get it to also print the queryid in the contents of the file? (I can concatenate all of these files later)


                              Thanks for the help- I'm finding a lot of the learning is coming from things not working- and then seeing how tweaking various things make it work right!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 08:47 AM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X