SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom local blast results detq182 Bioinformatics 3 07-07-2019 07:58 AM
BLAST+ strange results nupurgupta Bioinformatics 6 06-08-2012 08:53 AM
Get chromosome number from BLAST results logicthief Bioinformatics 5 04-12-2012 06:34 PM
blast results nitinkumar Bioinformatics 8 05-16-2011 07:17 AM
Parsing BLAST results using BioPerl Ben Saville Bioinformatics 8 08-24-2010 07:43 AM

Reply
 
Thread Tools
Old 10-13-2013, 10:36 AM   #1
milo0615
Member
 
Location: Walnut, California

Join Date: Dec 2012
Posts: 39
Default Blast Output Results Analysis

Hello,

I have successfully blasted 2500 merged COGS against 40 different assembly databases. However, I now have 40 large blast results which I need to analyze and select the assembly with the most hits. My questions are the following:

- What would be the best way or the best practice to analyze all of the blast results to check for the assembly with the most hits?

- Is there a free application that would help with the analysis?

Thank you in advance.
milo0615 is offline   Reply With Quote
Old 10-13-2013, 11:28 AM   #2
atcghelix
Member
 
Location: CA

Join Date: Jul 2013
Posts: 74
Default

When you say you want the assembly with the most hits, do you mean the assembly that had the fewest number of "no hits found" (i.e. perfect score is 2500--each had at least one hit), or the assembly with the most hits (each of the 2500 COGS have multiple hits---perfect score would be way more than 2500 total hits).
atcghelix is offline   Reply With Quote
Old 10-13-2013, 11:50 AM   #3
milo0615
Member
 
Location: Walnut, California

Join Date: Dec 2012
Posts: 39
Default

Quote:
Originally Posted by atcghelix View Post
When you say you want the assembly with the most hits, do you mean the assembly that had the fewest number of "no hits found" (i.e. perfect score is 2500--each had at least one hit), or the assembly with the most hits (each of the 2500 COGS have multiple hits---perfect score would be way more than 2500 total hits).
I would say the assembly that had the fewest number of "no hits found." Do you think that would be a better selection?
milo0615 is offline   Reply With Quote
Old 10-13-2013, 12:08 PM   #4
atcghelix
Member
 
Location: CA

Join Date: Jul 2013
Posts: 74
Default

I'm not sure--it sort of depends on what you want to know. Short kmer values will probably have more hits overall, but the hits will be shorter. I often am trying to find the assembly kmer value that has the highest number of hits that fully span the length of the query sequence.

If you just want to see how many 'No hits found' there are, you can use:
grep -c 'No hits found' <filename>
atcghelix is offline   Reply With Quote
Old 10-13-2013, 12:43 PM   #5
milo0615
Member
 
Location: Walnut, California

Join Date: Dec 2012
Posts: 39
Default

Quote:
Originally Posted by atcghelix View Post
I'm not sure--it sort of depends on what you want to know. Short kmer values will probably have more hits overall, but the hits will be shorter. I often am trying to find the assembly kmer value that has the highest number of hits that fully span the length of the query sequence.

If you just want to see how many 'No hits found' there are, you can use:
grep -c 'No hits found' <filename>
So you find the best kmer assembly based on the "highest number of hits?" I just want to know which is the best optimal kmer assembly by blasting it against COGS, or should I pick the best hits based on the e-value? How do you pick your best kmer?
milo0615 is offline   Reply With Quote
Old 10-13-2013, 01:04 PM   #6
atcghelix
Member
 
Location: CA

Join Date: Jul 2013
Posts: 74
Default

I'm not sure what the particulars of your experiment are. I'm often dealing with targeted sequencing, where we are trying to sequence a subset of a few thousand regions of a genome. For something like this, evaluating the best assembly is a little tricky. For each assembly kmer value, I blast my target regions against the assembly. I want to find the assembly that:

1) matches as many target regions as possible (i.e. lowest number of "No hits found" in the blast report)
2) maximizes the number of target regions that have contigs that match along their entire length (if a target region is 300 basepairs long, I want assemblies that make contigs where the alignment between the target and the contig is 300 base pairs long, or as close as possible).

This is more involved than counting with grep. It involves going through each blast query, denoting how long the query sequence is, getting the length of the alignment for the longest hit, and comparing the two. I use Bio::SearchIO from the bioPerl package for this sort of thing.

I'm not sure what the standard is, or if there is an agreed upon protocol for evaluating assemblies where the goals are more complex than maximizing N50. It's an active area of research for sure (i.e. http://www.biomedcentral.com/1471-2164/14/465). It may be appropriate to merge assemblies from different kmer values as well then reevaluate--something else to look into.
atcghelix is offline   Reply With Quote
Old 10-13-2013, 10:22 PM   #7
mike.t
Member
 
Location: Spain

Join Date: Mar 2010
Posts: 36
Default

This is a great opportunity for you to learn to program in Python or Perl!
mike.t is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO