SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
About Drosophila melanogaster keliu.iluke Bioinformatics 1 07-18-2014 03:13 AM
Site with raw reads of Drosophila Melanogaster NGS_New_User Bioinformatics 2 01-21-2014 11:29 AM
The Drosophila melanogaster Genetic Reference Panel adaptivegenome Literature Watch 0 02-10-2012 09:24 AM
RNA-Seq: The developmental transcriptome of Drosophila melanogaster. Newsbot! Literature Watch 0 12-24-2010 02:13 AM
PubMed: Massively parallel resequencing of the isogenic Drosophila melanogaster strai Newsbot! Literature Watch 0 08-20-2009 05:00 AM

Reply
 
Thread Tools
Old 05-08-2015, 06:45 AM   #1
hlyates
Member
 
Location: Manhattan, Kansas

Join Date: Mar 2015
Posts: 29
Default How to blastx on Drosophila melanogaster (taxid:7227) with raw data?

Here is some background. I am doing a preliminary study on some raw (cleaned) data. I want to find positive hits (I already used blastn to find negative hits).

As I understand it, blastx takes a nucleotide query (my raw data) and blasts it against a protein database.


My question is how can I make my protein database? Is there anyway I can get ncbi to give me a protein database based on drosphila melanogaster?

Here is what I think:

Quote:
blastx -query MYRAWCLEANEDASSEMBLY.fasta -db DBNAME -out YOURASSEMBLY_BLASTX2DBNAME -evalue 4E-5
Again, to repeat myself for clarity. I need to know what and how to obtain/make DBNAME in the query I wrote above.
Thank you so much for your patience in reading my question.

Disclaimer: I have read the ncbi cookbook and command api as found here. My more serious work will follow when I do an actual assembly. Again this is just preliminary study stuff. Thank you.

Last edited by hlyates; 05-08-2015 at 06:54 AM.
hlyates is offline   Reply With Quote
Old 05-08-2015, 07:15 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Answer below only relates to "how do I make protein blast database".

Go here: http://www.uniprot.org/uniprot/?quer...D+reviewed:yes

Click download button to download 3246 "reviewed" D. melanogaster proteins in fasta format. Build the blast database.

OR

ftp://ftp.ncbi.nlm.nih.gov/genomes/D.../RELEASE_5_48/

Get the *.faa files from the chromosome directories, cat the files into one and make blast database.

or

Get the protein sequence from: ftp://ftp.ensembl.org/pub/release-79....pep.all.fa.gz Build database.

or

You may also be able to subset the drosophila data using the blastdb_alias tool: http://www.ncbi.nlm.nih.gov/books/NBK279693/
GenoMax is offline   Reply With Quote
Old 05-08-2015, 07:20 AM   #3
hlyates
Member
 
Location: Manhattan, Kansas

Join Date: Mar 2015
Posts: 29
Default

Quote:
Originally Posted by GenoMax View Post
Answer below only relates to "how do I make protein blast database".

Go here: http://www.uniprot.org/uniprot/?quer...D+reviewed:yes

Click download button to download 3246 "reviewed" D. melanogaster proteins in fasta format. Build the blast database.

OR

ftp://ftp.ncbi.nlm.nih.gov/genomes/D.../RELEASE_5_48/

Get the *.faa files from the chromosome directories, cat the files into one and make blast database.

or

Get the protein sequence from: ftp://ftp.ensembl.org/pub/release-79....pep.all.fa.gz Build database.

or

You may also be able to subset the drosophila data using the blastdb_alias tool: http://www.ncbi.nlm.nih.gov/books/NBK279693/
In order to further learn, if I wanted to apply this to another species in the future, do all three have a searchable database? I assume so sir. Thank you for your assistance. Also, is there a way to do this with non-redundant proteins (nr)? That is, would it be possible to make something from All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects? I'm sorry if I am asking stupid question, but I learn best when talking to experts.

Last edited by hlyates; 05-08-2015 at 07:29 AM.
hlyates is offline   Reply With Quote
Old 05-08-2015, 07:36 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

No question is stupid. As you said one only learns by asking questions.

blastdb_alias tool should be able to do it from refseq_protein blast index by sub-setting (I think that would be the database to use).

Let me point out a subtle distinction.

Links above give you "known" proteins/transcriptome from Drosophila which is non-redundant information. If you were to download protein sequence file from taxonomy browser: http://www.ncbi.nlm.nih.gov/protein?term=txid7227[Organism] then you are likely to get redundant entries.
GenoMax is offline   Reply With Quote
Old 05-08-2015, 12:22 PM   #5
hlyates
Member
 
Location: Manhattan, Kansas

Join Date: Mar 2015
Posts: 29
Default

Quote:
Originally Posted by GenoMax View Post
No question is stupid. As you said one only learns by asking questions.

blastdb_alias tool should be able to do it from refseq_protein blast index by sub-setting (I think that would be the database to use).

Let me point out a subtle distinction.

Links above give you "known" proteins/transcriptome from Drosophila which is non-redundant information. If you were to download protein sequence file from taxonomy browser: http://www.ncbi.nlm.nih.gov/protein?term=txid7227[Organism] then you are likely to get redundant entries.

Hmmm. This is going in a more complicated direction than I originally thought. Which is okay, but just want to be sure we are on the right track.


I talked to my colleague is who isn't knowledgeable about how to do this, but does know what I need to do. He stated that I need to do positive filtering with blastx by doing a search on protein database. He guess I can download them by fruit fly (aka taxid?) proteins.

What would be the easiest way to do this I guess? I thought it would be the nr database blastx option that we see on the ncbi database as seen here, but only for the fruit fly.


I'll rephrase my ramblings more concisely with the following.
  1. 1. Do the filtering on the protein database for fruit fly (I have no idea how to do this)
  2. 2. Run makeblastdb on the filtered.fasta file?
    Code:
    makeblastdb filteredproteinfruitfly.fasta -dbtype nucl -hash_index -out myfiltered_db
hlyates is offline   Reply With Quote
Old 05-08-2015, 02:46 PM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Quote:
Originally Posted by hlyates View Post
I'll rephrase my ramblings more concisely with the following.
  1. Do the filtering on the protein database for fruit fly (I have no idea how to do this)
This is becoming a case of blind leading the blind (since I don't exactly know what you are trying to achieve).

I am interpreting this to mean that if your sequence has a hit in the Drosophila protein database (positive filtering, wording from your post) you want to keep it. Is that correct?

If so this is straightforward. If you want *every* Drosophila melanogaster protein then use the tax browser download (link posted before) or you could use nr (but you may not be able to filter on tax id since NCBI does not build the blast indexes to include that info. I remember going over this a few weeks ago with either you or someone else on the forum).

Quote:
Originally Posted by hlyates View Post
  • Run makeblastdb on the filtered.fasta file?
    Code:
    makeblastdb filteredproteinfruitfly.fasta -dbtype nucl -hash_index -out myfiltered_db
Is this the set with hits to Drosophila? Then do what?

Last edited by GenoMax; 05-08-2015 at 03:19 PM.
GenoMax is offline   Reply With Quote
Old 05-08-2015, 02:57 PM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

If all you are interested in is sequences that "hit" Drosophila then I would suggest using BBSplit. Even if it is the other way around (i.e. sequences that are not Drosophila) that should still work.

What is the size distribution of your input fasta?
GenoMax is offline   Reply With Quote
Old 05-10-2015, 06:09 AM   #8
hlyates
Member
 
Location: Manhattan, Kansas

Join Date: Mar 2015
Posts: 29
Default

Quote:
Originally Posted by GenoMax View Post
This is becoming a case of blind leading the blind (since I don't exactly know what you are trying to achieve).

I am interpreting this to mean that if your sequence has a hit in the Drosophila protein database (positive filtering, wording from your post) you want to keep it. Is that correct?
That is correct.


Quote:
If so this is straightforward. If you want *every* Drosophila melanogaster protein then use the tax browser download (link posted before) or you could use nr (but you may not be able to filter on tax id since NCBI does not build the blast indexes to include that info. I remember going over this a few weeks ago with either you or someone else on the forum).
This I do understand how to do, I think. I merely download it as a .fasta file, yes? You did teach me that trick and it was pretty cool.

Quote:
Is this the set with hits to Drosophila? Then do what?
Right, that is my question. Once I download the file. Do I use the makedb command first on the downloaded file from the browser? I intend to run blastx on the db file I create. Based on the ncbi documentation, this is the approach I think we must take.

Again, thanks geno max.
hlyates is offline   Reply With Quote
Old 05-10-2015, 07:27 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Quote:
Originally Posted by hlyates View Post
Right, that is my question. Once I download the file. Do I use the makedb command first on the downloaded file from the browser? I intend to run blastx on the db file I create. Based on the ncbi documentation, this is the approach I think we must take.

Again, thanks geno max.
Order of operations:

1. Download all Drosophila proteins from tax browser link (since this is what your collaborator seems to want you to do) as multi-fasta format file.
2. Make blast database using the fasta file.
3. Blastx with your sequences using parameters you want (e-value cutoff etc). I would just choose the tabular output format since you can grab the sequence ID's that show a hit from this table.
4. Use faFilter utility (http://hgdownload.soe.ucsc.edu/admin...86_64/faFilter) to get a subset that contains sequences from your list that hit Drosophila proteins.
GenoMax is offline   Reply With Quote
Old 05-10-2015, 06:03 PM   #10
hlyates
Member
 
Location: Manhattan, Kansas

Join Date: Mar 2015
Posts: 29
Default

Quote:
Originally Posted by GenoMax View Post
Order of operations:

1. Download all Drosophila proteins from tax browser link (since this is what your collaborator seems to want you to do) as multi-fasta format file.
2. Make blast database using the fasta file.
3. Blastx with your sequences using parameters you want (e-value cutoff etc). I would just choose the tabular output format since you can grab the sequence ID's that show a hit from this table.
4. Use faFilter utility (http://hgdownload.soe.ucsc.edu/admin...86_64/faFilter) to get a subset that contains sequences from your list that hit Drosophila proteins.
I follow perfectly well until step #4. Why is this step necessary? Since we are blasting against all fruit fly proteins (step #1-2), then doesn't blastx give us the subset of proteins that contains sequences from our list?

Or am I misunderstanding and faFilter will read my output and only report the significant results?

In the meantime, I'll read up on faFilter and maybe can answer my own question, but super appreciate the opportunity to converse with a professional. I'm a one dog show at the moment, so to speak.
hlyates is offline   Reply With Quote
Old 05-11-2015, 01:14 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Step 3 will report sequences from your set that hit a Drosophila protein. Some sequences may hit more than one protein record (and vice versa) since you are starting with *every* Drosophila protein based on the tax ID.

I included #4 on assumption that you will want to divide your own squence file into two pools. One that is represented in #3 and other that does not. faFilter is an easy utility to do that.
GenoMax is offline   Reply With Quote
Old 05-11-2015, 06:21 AM   #12
hlyates
Member
 
Location: Manhattan, Kansas

Join Date: Mar 2015
Posts: 29
Default

Quote:
Originally Posted by GenoMax View Post
Step 3 will report sequences from your set that hit a Drosophila protein. Some sequences may hit more than one protein record (and vice versa) since you are starting with *every* Drosophila protein based on the tax ID.

I included #4 on assumption that you will want to divide your own squence file into two pools. One that is represented in #3 and other that does not. faFilter is an easy utility to do that.
I have a few wrap up questions Geno Max. Then I will go on my own as much as possible. Your expert advice has already optimized my approach so much. I doubt I can thank you enough.

1. I think your quote makes sense. Basically, the faFilter will divide sequences that got a hit and others that did not. Yes?

2. I have one last question (with a followup) this is related to blastn output. Rather than make a another question, I want to ask you it here. Last night my blastn output finally finished. I used a cut off of 1E-42. I am seeing records like this:

Code:
***** No hits found *****



Lambda      K        H
    1.33    0.621     1.12

Gapped
Lambda      K        H
    1.28    0.460    0.850

Effective search space used: 3617178691534


Query= HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT

Length=101
                                                                      Score                                                    E
Sequences producing significant alignments:                          (Bits)  Val                                               ue

gb|AF479582.1|  Boophilus microplus paramyosin mRNA, complete cds       182   6e                                               -43


>gb|AF479582.1| Boophilus microplus paramyosin mRNA, complete cds
Length=2922

 Score =   182 bits (98),  Expect = 6e-43
 Identities = 100/101 (99%), Gaps = 0/101 (0%)
 Strand=Plus/Minus

Query  1     GTTCGCGTTGGAAGCGGCGCACGCGGGTCAAGTTCTGCTGGCTCAGACCCTCCTGTTCGT  60
             |||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||
Sbjct  2512  GTTCGCGTTGGAAGCGGCGCACGCGGGTCAGGTTCTGCTGGCTCAGACCCTCCTGTTCGT  2453

Query  61    TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  101
             |||||||||||||||||||||||||||||||||||||||||
Sbjct  2452  TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  2412



Lambda      K        H
    1.33    0.621     1.12

Gapped
Lambda      K        H
    1.28    0.460    0.850

Effective search space used: 3617178691534


Query= HISEQ:154:C47MGACXX:8:1101:5762:2190 1:N:0:CGATGT

Length=101
Why is it claiming no hit was found? Since my cutoff value is 1E-42, is it not true that 6E-43 is smaller and hence categorizes as a hit? Am I misunderstanding the output here? I thought it should state ***alignment***?

3. As an expert, what would you suggest I use to parse the blastn output blast.txt? I did not use tabular output for blastn. I did some googling and there is bioperl and biopython, but would feel comfortable asking an expert such as yourself if there is better one out there? I just want to obtain my sequence, the species it hits with, and e-value. Or, I can write my own python script, but afraid I would be reinventing the wheel?


After this, I plan on reading up on BLOSUM62 again and other things to deepen my understanding. Again, I am very humbled by your help. You are helping me gain an understanding of were my skill set and knowledge should be. I need more experience too. It is my dream to become better.
hlyates is offline   Reply With Quote
Old 05-11-2015, 06:28 AM   #13
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by hlyates View Post
Code:
***** No hits found *****



Lambda      K        H
    1.33    0.621     1.12

Gapped
Lambda      K        H
    1.28    0.460    0.850

Effective search space used: 3617178691534


Query= HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT

Length=101
                                                                      Score                                                    E
Sequences producing significant alignments:                          (Bits)  Val                                               ue

gb|AF479582.1|  Boophilus microplus paramyosin mRNA, complete cds       182   6e                                               -43


>gb|AF479582.1| Boophilus microplus paramyosin mRNA, complete cds
Length=2922

 Score =   182 bits (98),  Expect = 6e-43
 Identities = 100/101 (99%), Gaps = 0/101 (0%)
 Strand=Plus/Minus

Query  1     GTTCGCGTTGGAAGCGGCGCACGCGGGTCAAGTTCTGCTGGCTCAGACCCTCCTGTTCGT  60
             |||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||
Sbjct  2512  GTTCGCGTTGGAAGCGGCGCACGCGGGTCAGGTTCTGCTGGCTCAGACCCTCCTGTTCGT  2453

Query  61    TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  101
             |||||||||||||||||||||||||||||||||||||||||
Sbjct  2452  TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  2412



Lambda      K        H
    1.33    0.621     1.12

Gapped
Lambda      K        H
    1.28    0.460    0.850

Effective search space used: 3617178691534


Query= HISEQ:154:C47MGACXX:8:1101:5762:2190 1:N:0:CGATGT

Length=101
Why is it claiming no hit was found? Since my cutoff value is 1E-42, is it not true that 6E-43 is smaller and hence categorizes as a hit? Am I misunderstanding the output here? I thought it should state ***alignment***?
In standard text based BLAST outputs like this each report starts with the "Query=" line and ends ends with the "Effective search space…" line.

The "No hits found" is referring to the Query immediately prior to "HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT" Look just above the "***** No hits found *****" line and you should see another line that starts with "Query=". That is the query (raw sequence) for which no hit was found.
kmcarr is offline   Reply With Quote
Old 05-11-2015, 06:51 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Quote:
Originally Posted by hlyates View Post
I think your quote makes sense. Basically, the faFilter will divide sequences that got a hit and others that did not. Yes?
That is correct.

Quote:
Why is it claiming no hit was found? Since my cutoff value is 1E-42, is it not true that 6E-43 is smaller and hence categorizes as a hit? Am I misunderstanding the output here? I thought it should state ***alignment***?
@kmcarr already addressed this question. See note about e-value below.

Quote:
3. As an expert, what would you suggest I use to parse the blastn output blast.txt? I did not use tabular output for blastn. I did some googling and there is bioperl and biopython, but would feel comfortable asking an expert such as yourself if there is better one out there? I just want to obtain my sequence, the species it hits with, and e-value. Or, I can write my own python script, but afraid I would be reinventing the wheel?
If you know python then just use whatever code/module that suits your needs. A google search should bring up plenty of options.

By now, hopefully you have read up a bit on the e-value. This comes from Karlin-Altschul statistics (http://www.ncbi.nlm.nih.gov/BLAST/tu...ltschul-1.html) and ultimately depends on the size of the database you are searching against. In general you want to search against the smallest database that makes sense to get most meaningful e-values.
GenoMax is offline   Reply With Quote
Old 05-11-2015, 09:14 AM   #15
hlyates
Member
 
Location: Manhattan, Kansas

Join Date: Mar 2015
Posts: 29
Default

Quote:
Originally Posted by kmcarr View Post
In standard text based BLAST outputs like this each report starts with the "Query=" line and ends ends with the "Effective search space…" line.

The "No hits found" is referring to the Query immediately prior to "HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT" Look just above the "***** No hits found *****" line and you should see another line that starts with "Query=". That is the query (raw sequence) for which no hit was found.
Fantastic. Many salutations and thanks. Can you please point me to the NCBI docs which talk about this? Dr. Google didn't report a hit back for me. Would like to learn more about the formatting.
hlyates is offline   Reply With Quote
Old 05-11-2015, 09:50 AM   #16
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

If you run "blastx -help" on the command line you will get all options for blastx. One of the sections is for output formats. Default format is "0".

Code:
 -outfmt <String>
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1) 
    12 = JSON Seqalign output 
   
   Options 6, 7, and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
            qseqid means Query Seq-id
               qgi means Query GI
              qacc means Query accesion
           qaccver means Query accesion.version
              qlen means Query sequence length
            sseqid means Subject Seq-id
         sallseqid means All subject Seq-id(s), separated by a ';'
               sgi means Subject GI
            sallgi means All subject GIs
              sacc means Subject accession
           saccver means Subject accession.version
           sallacc means All subject accessions
              slen means Subject sequence length
            qstart means Start of alignment in query
              qend means End of alignment in query
            sstart means Start of alignment in subject
              send means End of alignment in subject
              qseq means Aligned part of query sequence
              sseq means Aligned part of subject sequence
            evalue means Expect value
          bitscore means Bit score
             score means Raw score
            length means Alignment length
            pident means Percentage of identical matches
            nident means Number of identical matches
          mismatch means Number of mismatches
          positive means Number of positive-scoring matches
           gapopen means Number of gap openings
              gaps means Total number of gaps
              ppos means Percentage of positive-scoring matches
            frames means Query and subject frames separated by a '/'
            qframe means Query frame
            sframe means Subject frame
              btop means Blast traceback operations (BTOP)
           staxids means unique Subject Taxonomy ID(s), separated by a ';'
                         (in numerical order)
         sscinames means unique Subject Scientific Name(s), separated by a ';'
         scomnames means unique Subject Common Name(s), separated by a ';'
        sblastnames means unique Subject Blast Name(s), separated by a ';'
                         (in alphabetical order)
        sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                         (in alphabetical order) 
            stitle means Subject Title
        salltitles means All Subject Title(s), separated by a '<>'
           sstrand means Subject Strand
             qcovs means Query Coverage Per Subject
           qcovhsp means Query Coverage Per HSP
   When not provided, the default value is:
   'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `0'
When using a multi-fasta input file, each sequence will produce an output block that will start with "Query=" line and end with the "Effective search space…" line.
GenoMax is offline   Reply With Quote
Old 05-11-2015, 12:07 PM   #17
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by hlyates View Post
Fantastic. Many salutations and thanks. Can you please point me to the NCBI docs which talk about this? Dr. Google didn't report a hit back for me. Would like to learn more about the formatting.
It has been so long ago I don't recall if I ever did read any formal documentation on the Pairwise (default output 0) format or just figured it out through trial and error.

As to parsing these plain text reports I have in the past used both BioPerl Bio::SearchIO and hand rolled code; pluses and minus in each. But as a general recommendation I would say that if you foresee needing to regularly parse BLAST reports I would avoid the default "Pairwise" output format altogether. The problem is that the format wasn't really designed for automated parsing and so parsing code is easily broken. If you plan to do a lot of parsing then the two better choices are tabular or XML. If your needs are simple (e.g. hit ids, evalues, start and end locations) then tabular is the way to go. Output is small and is very simple to parse. If you have more complex needs (e.g. capturing query/target alignments) then have your BLAST job output XML and parse it using the available modules from BioPerl, BioPython, etc. Since XML is such a structured format automated parsing is more robust. Also, since the XML format retains all of the information present in the Pairwise format it is possible to convert the interesting bits of the XML output into human readable form (again using the 'Bio' modules).
kmcarr is offline   Reply With Quote
Old 05-11-2015, 12:13 PM   #18
hlyates
Member
 
Location: Manhattan, Kansas

Join Date: Mar 2015
Posts: 29
Default

Thank you Geno Max and kmcarr. I am thinking I may run my job again and use tabular/xml output. I can then more easily apply a script to it. If I understand you both, it seems biopython can parse tabular quite easily. I might go that route because I just need basic information such as:

hit id
e-value
input sequence (the input sequence that alignment with something in nt database)
target sequence organism id and name

You pros are great and if I knew you in person I would buy you both some drinks.
hlyates is offline   Reply With Quote
Old 05-11-2015, 01:50 PM   #19
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Provided you have access to a cluster and if you are going to do this over then I suggest you break-up your original file into multiple smaller ones and run the blast jobs in parallel. It would speed things up significantly.

Last edited by GenoMax; 05-11-2015 at 02:08 PM.
GenoMax is offline   Reply With Quote
Old 05-12-2015, 08:52 AM   #20
hlyates
Member
 
Location: Manhattan, Kansas

Join Date: Mar 2015
Posts: 29
Default

Quote:
Originally Posted by GenoMax View Post
Order of operations:

1. Download all Drosophila proteins from tax browser link (since this is what your collaborator seems to want you to do) as multi-fasta format file.
2. Make blast database using the fasta file.
3. Blastx with your sequences using parameters you want (e-value cutoff etc). I would just choose the tabular output format since you can grab the sequence ID's that show a hit from this table.
4. Use faFilter utility (http://hgdownload.soe.ucsc.edu/admin...86_64/faFilter) to get a subset that contains sequences from your list that hit Drosophila proteins.
Since I am downloading a fasta file of proteins, when I make the database should the -dbtype option be:
  1. 1. nucl? (I think this is nucleotide)
  2. 2. prot (this is for proteins and hence what I should choose)?

I feel stupid for not having thought of that question in advance. Only ran into it as I actually started reading the
Code:
makeblastdb -help
docs.
hlyates is offline   Reply With Quote
Reply

Tags
blastx, noob, terminal, unix

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:24 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO