SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
identifying human liver specific genes cariboudoug Bioinformatics 2 11-03-2014 09:03 AM
Code for identifying and renaming contigs thh32 Bioinformatics 0 10-16-2014 05:06 AM
Identifying unique alleles/haplotypes for contigs Corydoras Bioinformatics 0 07-11-2014 04:21 AM
identifying the target genes of unkown genes in Cis- and Trans- Mechanism psstshush RNA Sequencing 1 12-24-2013 04:23 AM
Identifying nde genes as controls ksherwood Bioinformatics 4 11-25-2011 03:00 PM

Reply
 
Thread Tools
Old 01-22-2015, 08:09 AM   #1
vanillasky
Member
 
Location: Europe

Join Date: Mar 2014
Posts: 42
Smile Blastp for identifying genes in contigs

I have used NCBI blastp as a first step to identify predicted ORFs in my assembled contigs. My question is how to filter these results based on % identity of the protein and e-value scores. I picked a loose protein cut-off of 20% but I'm not sure how one goes about selecting the best filtering options.
Any input would be appreciated. Thnx.
vanillasky is offline   Reply With Quote
Old 01-22-2015, 08:38 PM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

I would output the BLAST results in tabular format, and then filter using a simple script. Tabular output is quite easy to work with. You could even do this in Excel if you really prefer.
maubp is offline   Reply With Quote
Old 01-22-2015, 10:34 PM   #3
vanillasky
Member
 
Location: Europe

Join Date: Mar 2014
Posts: 42
Default Cut-off values

Thank you for your suggestion. However, my question is a fundamental one. How does one decide on a cut-off value (s) with which to filter out the blasp results? For example based on greatest % identity for a protein match or an e-value score?
vanillasky is offline   Reply With Quote
Old 01-23-2015, 12:37 AM   #4
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Check the discussion here.
__________________
savetherhino.org
rhinoceros is offline   Reply With Quote
Old 01-23-2015, 05:31 AM   #5
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

I don't have access to the paper right now (I'm away from the office), but I think reading this would be useful:

Punta and Ofran (2008) The Rough Guide to In Silico Function Prediction,
or How To Use Sequence and Structure Information To Predict Protein
Function. PLoS Comput Biol 4(10): e1000160.
http://dx.doi.org/10.1371/journal.pcbi.1000160
maubp is offline   Reply With Quote
Old 01-23-2015, 06:05 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,992
Default

@vanillasky: Many things informaticians do can be automated/done in high throughput mode but once you get to "annotation" it may be best to slow down and do things the right way (that is why "finishing" a genome takes so long).

You could choose a value ("n") for the % cut-off and get a good approximation of the identities of proteins in your dataset. Afterwards, only a fraction (more or less, would depend on how good your blast results were) of those identities may turn out to be correct on more rigorous examination (as an example http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2673347/).

In general you can be reasonably sure about the validity of a protein blast if the e value is < 10^-3 and sequence identity is >= 25%.
GenoMax is offline   Reply With Quote
Reply

Tags
annotation, blastp, percent identity

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:27 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO