Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to blastx on Drosophila melanogaster (taxid:7227) with raw data?

    Here is some background. I am doing a preliminary study on some raw (cleaned) data. I want to find positive hits (I already used blastn to find negative hits).

    As I understand it, blastx takes a nucleotide query (my raw data) and blasts it against a protein database.


    My question is how can I make my protein database? Is there anyway I can get ncbi to give me a protein database based on drosphila melanogaster?

    Here is what I think:

    blastx -query MYRAWCLEANEDASSEMBLY.fasta -db DBNAME -out YOURASSEMBLY_BLASTX2DBNAME -evalue 4E-5
    Again, to repeat myself for clarity. I need to know what and how to obtain/make DBNAME in the query I wrote above.
    Thank you so much for your patience in reading my question.

    Disclaimer: I have read the ncbi cookbook and command api as found here. My more serious work will follow when I do an actual assembly. Again this is just preliminary study stuff. Thank you.
    Last edited by hlyates; 05-08-2015, 06:54 AM.

  • #2
    Answer below only relates to "how do I make protein blast database".

    Go here: http://www.uniprot.org/uniprot/?quer...D+reviewed:yes

    Click download button to download 3246 "reviewed" D. melanogaster proteins in fasta format. Build the blast database.

    OR

    ftp://ftp.ncbi.nlm.nih.gov/genomes/D.../RELEASE_5_48/

    Get the *.faa files from the chromosome directories, cat the files into one and make blast database.

    or

    Get the protein sequence from: ftp://ftp.ensembl.org/pub/release-79....pep.all.fa.gz Build database.

    or

    You may also be able to subset the drosophila data using the blastdb_alias tool: http://www.ncbi.nlm.nih.gov/books/NBK279693/

    Comment


    • #3
      Originally posted by GenoMax View Post
      Answer below only relates to "how do I make protein blast database".

      Go here: http://www.uniprot.org/uniprot/?quer...D+reviewed:yes

      Click download button to download 3246 "reviewed" D. melanogaster proteins in fasta format. Build the blast database.

      OR

      ftp://ftp.ncbi.nlm.nih.gov/genomes/D.../RELEASE_5_48/

      Get the *.faa files from the chromosome directories, cat the files into one and make blast database.

      or

      Get the protein sequence from: ftp://ftp.ensembl.org/pub/release-79....pep.all.fa.gz Build database.

      or

      You may also be able to subset the drosophila data using the blastdb_alias tool: http://www.ncbi.nlm.nih.gov/books/NBK279693/
      In order to further learn, if I wanted to apply this to another species in the future, do all three have a searchable database? I assume so sir. Thank you for your assistance. Also, is there a way to do this with non-redundant proteins (nr)? That is, would it be possible to make something from All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects? I'm sorry if I am asking stupid question, but I learn best when talking to experts.
      Last edited by hlyates; 05-08-2015, 07:29 AM.

      Comment


      • #4
        No question is stupid. As you said one only learns by asking questions.

        blastdb_alias tool should be able to do it from refseq_protein blast index by sub-setting (I think that would be the database to use).

        Let me point out a subtle distinction.

        Links above give you "known" proteins/transcriptome from Drosophila which is non-redundant information. If you were to download protein sequence file from taxonomy browser: http://www.ncbi.nlm.nih.gov/protein?term=txid7227[Organism] then you are likely to get redundant entries.

        Comment


        • #5
          Originally posted by GenoMax View Post
          No question is stupid. As you said one only learns by asking questions.

          blastdb_alias tool should be able to do it from refseq_protein blast index by sub-setting (I think that would be the database to use).

          Let me point out a subtle distinction.

          Links above give you "known" proteins/transcriptome from Drosophila which is non-redundant information. If you were to download protein sequence file from taxonomy browser: http://www.ncbi.nlm.nih.gov/protein?term=txid7227[Organism] then you are likely to get redundant entries.

          Hmmm. This is going in a more complicated direction than I originally thought. Which is okay, but just want to be sure we are on the right track.


          I talked to my colleague is who isn't knowledgeable about how to do this, but does know what I need to do. He stated that I need to do positive filtering with blastx by doing a search on protein database. He guess I can download them by fruit fly (aka taxid?) proteins.

          What would be the easiest way to do this I guess? I thought it would be the nr database blastx option that we see on the ncbi database as seen here, but only for the fruit fly.


          I'll rephrase my ramblings more concisely with the following.
          1. 1. Do the filtering on the protein database for fruit fly (I have no idea how to do this)
          2. 2. Run makeblastdb on the filtered.fasta file?
            Code:
            makeblastdb filteredproteinfruitfly.fasta -dbtype nucl -hash_index -out myfiltered_db

          Comment


          • #6
            Originally posted by hlyates View Post
            I'll rephrase my ramblings more concisely with the following.
            1. Do the filtering on the protein database for fruit fly (I have no idea how to do this)
            This is becoming a case of blind leading the blind (since I don't exactly know what you are trying to achieve).

            I am interpreting this to mean that if your sequence has a hit in the Drosophila protein database (positive filtering, wording from your post) you want to keep it. Is that correct?

            If so this is straightforward. If you want *every* Drosophila melanogaster protein then use the tax browser download (link posted before) or you could use nr (but you may not be able to filter on tax id since NCBI does not build the blast indexes to include that info. I remember going over this a few weeks ago with either you or someone else on the forum).

            Originally posted by hlyates View Post
            • Run makeblastdb on the filtered.fasta file?
              Code:
              makeblastdb filteredproteinfruitfly.fasta -dbtype nucl -hash_index -out myfiltered_db
            Is this the set with hits to Drosophila? Then do what?
            Last edited by GenoMax; 05-08-2015, 03:19 PM.

            Comment


            • #7
              If all you are interested in is sequences that "hit" Drosophila then I would suggest using BBSplit. Even if it is the other way around (i.e. sequences that are not Drosophila) that should still work.

              What is the size distribution of your input fasta?

              Comment


              • #8
                Originally posted by GenoMax View Post
                This is becoming a case of blind leading the blind (since I don't exactly know what you are trying to achieve).

                I am interpreting this to mean that if your sequence has a hit in the Drosophila protein database (positive filtering, wording from your post) you want to keep it. Is that correct?
                That is correct.


                If so this is straightforward. If you want *every* Drosophila melanogaster protein then use the tax browser download (link posted before) or you could use nr (but you may not be able to filter on tax id since NCBI does not build the blast indexes to include that info. I remember going over this a few weeks ago with either you or someone else on the forum).
                This I do understand how to do, I think. I merely download it as a .fasta file, yes? You did teach me that trick and it was pretty cool.

                Is this the set with hits to Drosophila? Then do what?
                Right, that is my question. Once I download the file. Do I use the makedb command first on the downloaded file from the browser? I intend to run blastx on the db file I create. Based on the ncbi documentation, this is the approach I think we must take.

                Again, thanks geno max.

                Comment


                • #9
                  Originally posted by hlyates View Post
                  Right, that is my question. Once I download the file. Do I use the makedb command first on the downloaded file from the browser? I intend to run blastx on the db file I create. Based on the ncbi documentation, this is the approach I think we must take.

                  Again, thanks geno max.
                  Order of operations:

                  1. Download all Drosophila proteins from tax browser link (since this is what your collaborator seems to want you to do) as multi-fasta format file.
                  2. Make blast database using the fasta file.
                  3. Blastx with your sequences using parameters you want (e-value cutoff etc). I would just choose the tabular output format since you can grab the sequence ID's that show a hit from this table.
                  4. Use faFilter utility (http://hgdownload.soe.ucsc.edu/admin...86_64/faFilter) to get a subset that contains sequences from your list that hit Drosophila proteins.

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    Order of operations:

                    1. Download all Drosophila proteins from tax browser link (since this is what your collaborator seems to want you to do) as multi-fasta format file.
                    2. Make blast database using the fasta file.
                    3. Blastx with your sequences using parameters you want (e-value cutoff etc). I would just choose the tabular output format since you can grab the sequence ID's that show a hit from this table.
                    4. Use faFilter utility (http://hgdownload.soe.ucsc.edu/admin...86_64/faFilter) to get a subset that contains sequences from your list that hit Drosophila proteins.
                    I follow perfectly well until step #4. Why is this step necessary? Since we are blasting against all fruit fly proteins (step #1-2), then doesn't blastx give us the subset of proteins that contains sequences from our list?

                    Or am I misunderstanding and faFilter will read my output and only report the significant results?

                    In the meantime, I'll read up on faFilter and maybe can answer my own question, but super appreciate the opportunity to converse with a professional. I'm a one dog show at the moment, so to speak.

                    Comment


                    • #11
                      Step 3 will report sequences from your set that hit a Drosophila protein. Some sequences may hit more than one protein record (and vice versa) since you are starting with *every* Drosophila protein based on the tax ID.

                      I included #4 on assumption that you will want to divide your own squence file into two pools. One that is represented in #3 and other that does not. faFilter is an easy utility to do that.

                      Comment


                      • #12
                        Originally posted by GenoMax View Post
                        Step 3 will report sequences from your set that hit a Drosophila protein. Some sequences may hit more than one protein record (and vice versa) since you are starting with *every* Drosophila protein based on the tax ID.

                        I included #4 on assumption that you will want to divide your own squence file into two pools. One that is represented in #3 and other that does not. faFilter is an easy utility to do that.
                        I have a few wrap up questions Geno Max. Then I will go on my own as much as possible. Your expert advice has already optimized my approach so much. I doubt I can thank you enough.

                        1. I think your quote makes sense. Basically, the faFilter will divide sequences that got a hit and others that did not. Yes?

                        2. I have one last question (with a followup) this is related to blastn output. Rather than make a another question, I want to ask you it here. Last night my blastn output finally finished. I used a cut off of 1E-42. I am seeing records like this:

                        Code:
                        ***** No hits found *****
                        
                        
                        
                        Lambda      K        H
                            1.33    0.621     1.12
                        
                        Gapped
                        Lambda      K        H
                            1.28    0.460    0.850
                        
                        Effective search space used: 3617178691534
                        
                        
                        Query= HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT
                        
                        Length=101
                                                                                              Score                                                    E
                        Sequences producing significant alignments:                          (Bits)  Val                                               ue
                        
                        gb|AF479582.1|  Boophilus microplus paramyosin mRNA, complete cds       182   6e                                               -43
                        
                        
                        >gb|AF479582.1| Boophilus microplus paramyosin mRNA, complete cds
                        Length=2922
                        
                         Score =   182 bits (98),  Expect = 6e-43
                         Identities = 100/101 (99%), Gaps = 0/101 (0%)
                         Strand=Plus/Minus
                        
                        Query  1     GTTCGCGTTGGAAGCGGCGCACGCGGGTCAAGTTCTGCTGGCTCAGACCCTCCTGTTCGT  60
                                     |||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||
                        Sbjct  2512  GTTCGCGTTGGAAGCGGCGCACGCGGGTCAGGTTCTGCTGGCTCAGACCCTCCTGTTCGT  2453
                        
                        Query  61    TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  101
                                     |||||||||||||||||||||||||||||||||||||||||
                        Sbjct  2452  TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  2412
                        
                        
                        
                        Lambda      K        H
                            1.33    0.621     1.12
                        
                        Gapped
                        Lambda      K        H
                            1.28    0.460    0.850
                        
                        Effective search space used: 3617178691534
                        
                        
                        Query= HISEQ:154:C47MGACXX:8:1101:5762:2190 1:N:0:CGATGT
                        
                        Length=101
                        Why is it claiming no hit was found? Since my cutoff value is 1E-42, is it not true that 6E-43 is smaller and hence categorizes as a hit? Am I misunderstanding the output here? I thought it should state ***alignment***?

                        3. As an expert, what would you suggest I use to parse the blastn output blast.txt? I did not use tabular output for blastn. I did some googling and there is bioperl and biopython, but would feel comfortable asking an expert such as yourself if there is better one out there? I just want to obtain my sequence, the species it hits with, and e-value. Or, I can write my own python script, but afraid I would be reinventing the wheel?


                        After this, I plan on reading up on BLOSUM62 again and other things to deepen my understanding. Again, I am very humbled by your help. You are helping me gain an understanding of were my skill set and knowledge should be. I need more experience too. It is my dream to become better.

                        Comment


                        • #13
                          Originally posted by hlyates View Post
                          Code:
                          ***** No hits found *****
                          
                          
                          
                          Lambda      K        H
                              1.33    0.621     1.12
                          
                          Gapped
                          Lambda      K        H
                              1.28    0.460    0.850
                          
                          Effective search space used: 3617178691534
                          
                          
                          Query= HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT
                          
                          Length=101
                                                                                                Score                                                    E
                          Sequences producing significant alignments:                          (Bits)  Val                                               ue
                          
                          gb|AF479582.1|  Boophilus microplus paramyosin mRNA, complete cds       182   6e                                               -43
                          
                          
                          >gb|AF479582.1| Boophilus microplus paramyosin mRNA, complete cds
                          Length=2922
                          
                           Score =   182 bits (98),  Expect = 6e-43
                           Identities = 100/101 (99%), Gaps = 0/101 (0%)
                           Strand=Plus/Minus
                          
                          Query  1     GTTCGCGTTGGAAGCGGCGCACGCGGGTCAAGTTCTGCTGGCTCAGACCCTCCTGTTCGT  60
                                       |||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||
                          Sbjct  2512  GTTCGCGTTGGAAGCGGCGCACGCGGGTCAGGTTCTGCTGGCTCAGACCCTCCTGTTCGT  2453
                          
                          Query  61    TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  101
                                       |||||||||||||||||||||||||||||||||||||||||
                          Sbjct  2452  TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  2412
                          
                          
                          
                          Lambda      K        H
                              1.33    0.621     1.12
                          
                          Gapped
                          Lambda      K        H
                              1.28    0.460    0.850
                          
                          Effective search space used: 3617178691534
                          
                          
                          Query= HISEQ:154:C47MGACXX:8:1101:5762:2190 1:N:0:CGATGT
                          
                          Length=101
                          Why is it claiming no hit was found? Since my cutoff value is 1E-42, is it not true that 6E-43 is smaller and hence categorizes as a hit? Am I misunderstanding the output here? I thought it should state ***alignment***?
                          In standard text based BLAST outputs like this each report starts with the "Query=" line and ends ends with the "Effective search space…" line.

                          The "No hits found" is referring to the Query immediately prior to "HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT" Look just above the "***** No hits found *****" line and you should see another line that starts with "Query=". That is the query (raw sequence) for which no hit was found.

                          Comment


                          • #14
                            Originally posted by hlyates View Post
                            I think your quote makes sense. Basically, the faFilter will divide sequences that got a hit and others that did not. Yes?
                            That is correct.

                            Why is it claiming no hit was found? Since my cutoff value is 1E-42, is it not true that 6E-43 is smaller and hence categorizes as a hit? Am I misunderstanding the output here? I thought it should state ***alignment***?
                            @kmcarr already addressed this question. See note about e-value below.

                            3. As an expert, what would you suggest I use to parse the blastn output blast.txt? I did not use tabular output for blastn. I did some googling and there is bioperl and biopython, but would feel comfortable asking an expert such as yourself if there is better one out there? I just want to obtain my sequence, the species it hits with, and e-value. Or, I can write my own python script, but afraid I would be reinventing the wheel?
                            If you know python then just use whatever code/module that suits your needs. A google search should bring up plenty of options.

                            By now, hopefully you have read up a bit on the e-value. This comes from Karlin-Altschul statistics (http://www.ncbi.nlm.nih.gov/BLAST/tu...ltschul-1.html) and ultimately depends on the size of the database you are searching against. In general you want to search against the smallest database that makes sense to get most meaningful e-values.

                            Comment


                            • #15
                              Originally posted by kmcarr View Post
                              In standard text based BLAST outputs like this each report starts with the "Query=" line and ends ends with the "Effective search space…" line.

                              The "No hits found" is referring to the Query immediately prior to "HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT" Look just above the "***** No hits found *****" line and you should see another line that starts with "Query=". That is the query (raw sequence) for which no hit was found.
                              Fantastic. Many salutations and thanks. Can you please point me to the NCBI docs which talk about this? Dr. Google didn't report a hit back for me. Would like to learn more about the formatting.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X