SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How would I go about calling BWA from a python script? prs321 Bioinformatics 7 07-07-2019 05:20 AM
NCBI SRA database bair Bioinformatics 15 11-03-2015 01:06 PM
python script modification of LSC0.2.4 ? arthur.yxt Bioinformatics 0 06-11-2013 01:32 PM
Error running oases_pipeline python script InesSoriguer Illumina/Solexa 4 05-21-2013 02:10 AM
gatk python script m_elena_bioinfo Bioinformatics 7 11-20-2011 06:40 AM

Reply
 
Thread Tools
Old 11-14-2013, 12:14 PM   #1
ynnus
Junior Member
 
Location: Wisconsin

Join Date: Nov 2013
Posts: 5
Default python script to bast NCBI's SRA database

I have a working python script that uses NCBIWWW to blast fasta files against the nt database.

I have been trying to find a way to do the same thing against NCBIs sequence read archive (SRA).

Does anyone have any idea on how to do this? Can I do this with qblast in NCBIWWW?
ynnus is offline   Reply With Quote
Old 11-14-2013, 11:01 PM   #2
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

I don't think the whole sra is available for online blast. Also, I don't really understand for what you need a python script for since there's the "-remote" flag in standalone blast..
__________________
savetherhino.org

Last edited by rhinoceros; 11-14-2013 at 11:04 PM.
rhinoceros is offline   Reply With Quote
Old 11-15-2013, 06:53 AM   #3
ynnus
Junior Member
 
Location: Wisconsin

Join Date: Nov 2013
Posts: 5
Default

They offer a blast on the website (categorized under specialized blast); the reason I made a script was when blasting against the SRA you select the individual data sets to blast against. I was hoping to write the script to automatically blast several nucleotide sequences against each dataset sequentially, while I can do this by hand with their website. The script would save me a lot of time.
ynnus is offline   Reply With Quote
Old 11-15-2013, 07:59 AM   #4
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

I see. The NCBI url api documentation could help you with your script.

edit. I just highlighted some relevant parts in www sra-blast. Shouldn't be too difficult to alter your script to make it work? Just make a list of all SRRs and loop over that in the DATABASE part.

If you manage to make it work, be sure to report back here

Code:
http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&USER_FORMAT_DEFAULTS=on&PAGE=MegaBlast&PROGRAM=blastn&
QUERY=XM_003757795&GAPCOSTS=0%200&MATCH_SCORES=1,-2&BLAST_SPEC=SRA&DATABASE=SRR027154%20SRR027155%20SRR027156&
BLAST_PROGRAMS=megaBlast&MAX_NUM_SEQ=1000&EXPECT=1e-5&WORD_SIZE=28&TEMPLATE_TYPE=0&TEMPLATE_LENGTH=0&FILTER=L&FILTER=m&
WWW_BLAST_TYPE=newblast&SHOW_OVERVIEW=true&SHOW_LINKOUT=false&ALIGNMENT_VIEW=Pairwise&MASK_CHAR=2&MASK_COLOR=1&
GET_SEQUENCE=false&NEW_VIEW=false&NCBI_GI=false&NUM_OVERVIEW=100&DESCRIPTIONS=1000&ALIGNMENTS=1000&FORMAT_OBJECT=Alignment&
FORMAT_TYPE=HTML&SHOW_CDS_FEATURE=false&OLD_BLAST=false&CMD=REQUEST
Added random linebreaks to make it more readable.

Edit again.

I've never used ncbi's url api but I've written a script for eutils. If these apis follow the same logic, then solving your problem would be quite trivial. If the parameters don't have to be in some specific order, you would in essence assign

base="Blast.cgi?PAGE_TYPE=BlastSearch&USER_FORMAT_DEFAULTS=on&PAGE=MegaBlast&PROGRAM=blastn&.."
query="$line"
db="BLAST_SPEC=SRA&DATABASE=$srr"

and then just assemble urls by looping over your queries and srr list. If qblast is anything like -remote flag in standalone blast, be sure to prepare for connection errors too..
__________________
savetherhino.org

Last edited by rhinoceros; 11-15-2013 at 11:24 AM.
rhinoceros is offline   Reply With Quote
Old 11-16-2013, 01:43 PM   #5
ynnus
Junior Member
 
Location: Wisconsin

Join Date: Nov 2013
Posts: 5
Default

So I was able to get it working using the url api and everything works very well. However I am limited to only using short sequences that don't exceed the URL size limit. I know there is an option in the url for QUERY_FILE = (which by the documentation works for fasta format files) I can't seem to figure out how to pass a fasta file through the url API. Once I get this working I'll be happy to post my script so that others that might be interested in this can use it.
ynnus is offline   Reply With Quote
Old 11-16-2013, 02:23 PM   #6
ynnus
Junior Member
 
Location: Wisconsin

Join Date: Nov 2013
Posts: 5
Default

Just got file uploading working, disregard the last post
ynnus is offline   Reply With Quote
Old 03-29-2014, 06:18 PM   #7
sciseim
Junior Member
 
Location: Seoul, South Korea

Join Date: Jan 2012
Posts: 3
Default

hi ynnus,

are you able to share the Python script with us?
sciseim is offline   Reply With Quote
Old 03-30-2014, 10:13 AM   #8
ynnus
Junior Member
 
Location: Wisconsin

Join Date: Nov 2013
Posts: 5
Default

Yes I can, it's not well documented as that project kinda dissolved shortly after this was written. Hopefully it will help someone out.

Code:
#!/usr/bin/env python

import os
import urllib2
import urllib
import requests
import time
import sys

from HTMLParser import HTMLParser

fasta_file = raw_input('Enter Fasta File Location... ').strip()
database_num = str(raw_input('Enter SRR Databases to search separated by a single space...'))

if os.path.isfile(fasta_file):
	file_path = os.path.dirname(fasta_file) + '/'
	file_name = os.path.basename(fasta_file)
else:
    print "Invalid File Path, File Doesn't exist"
    sys.exit()
        

url = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi')
args = {'CMD' : 'Put','DATABASE':database_num,'PROGRAM':'blastn',
			'BLAST_PROGRAMS':'discoMegablast','MATCH_SCORES':'1,-2',
			'BLAST_SPEC':'SRA','FORMAT_TYPE':'XML','FILTER':'L',
			'HITLIST_SIZE':'20000','GAPCOSTS':'5 2','MAX_NUM_SEQ':'20000'}

req = requests.post(url,params=args,files={'QUERY': open(file_path + file_name, 'rb')})


webpage = req.text

RID = ''
RTOE = 0
status = False
class NCBIblastStatusParser(HTMLParser):
    def handle_comment(weird,info):
    	global RID
    	global RTOE
    	global status
    	infostr = str(info)
    	if infostr.startswith('QBlastInfoBegin'):
    		ridstr = infostr.find('RID')
    		rtoestr = infostr.find('RTOE')
    		rtoeend = infostr.find('QBlastInfoEnd')
    		if ridstr != -1:
    			if infostr[(rtoestr+7):(rtoeend-1)] != '':
    				RID = infostr[(ridstr+6):(rtoestr-5)]
    				RTOE = int(infostr[(rtoestr+7):(rtoeend-1)])
    				print 'RID = ' + RID
    				print 'Est Time = ' + str(RTOE)
    				
    			else:
    				print "There was an error."
    				sys.exit()
    		

parser = NCBIblastStatusParser()
for line in webpage:
	parser.feed(line)

time.sleep(RTOE)
check_count = 0
sleep_time = 10
while status == False:
	getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
		'CMD=Get&'
		'RID=' + RID )
	datapage = requests.get(getURL)

	for line in datapage:
		if line.find('READY') != -1:
			status = True
	print "Still Blasting"
	check_count += 1
	if check_count == 10:
		print "Adding 10 seconds to check interval"
		check_count = 0
		sleep_time += 10
	time.sleep(sleep_time)

getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
		'CMD=Get&'
		'FORMAT_TYPE=XML&'
		'MAX_NUM_SEQ=100000&'
		'RESULTS_FILE=yes&'
		'RID=' + RID )

r = requests.get(getURL, stream=True)
with open(RID+'.xml', 'wb') as f:
    for chunk in r.iter_content(chunk_size=1024): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
            f.flush()


print "Compleated BLAST"
ynnus is offline   Reply With Quote
Old 03-30-2014, 05:41 PM   #9
sciseim
Junior Member
 
Location: Seoul, South Korea

Join Date: Jan 2012
Posts: 3
Default

much obliged
sciseim is offline   Reply With Quote
Old 04-14-2015, 01:28 PM   #10
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Quote:
Originally Posted by ynnus View Post
Yes I can, it's not well documented as that project kinda dissolved shortly after this was written. Hopefully it will help someone out.

Code:
#!/usr/bin/env python

import os
import urllib2
import urllib
import requests
import time
import sys

from HTMLParser import HTMLParser

fasta_file = raw_input('Enter Fasta File Location... ').strip()
database_num = str(raw_input('Enter SRR Databases to search separated by a single space...'))

if os.path.isfile(fasta_file):
	file_path = os.path.dirname(fasta_file) + '/'
	file_name = os.path.basename(fasta_file)
else:
    print "Invalid File Path, File Doesn't exist"
    sys.exit()
        

url = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi')
args = {'CMD' : 'Put','DATABASE':database_num,'PROGRAM':'blastn',
			'BLAST_PROGRAMS':'discoMegablast','MATCH_SCORES':'1,-2',
			'BLAST_SPEC':'SRA','FORMAT_TYPE':'XML','FILTER':'L',
			'HITLIST_SIZE':'20000','GAPCOSTS':'5 2','MAX_NUM_SEQ':'20000'}

req = requests.post(url,params=args,files={'QUERY': open(file_path + file_name, 'rb')})


webpage = req.text

RID = ''
RTOE = 0
status = False
class NCBIblastStatusParser(HTMLParser):
    def handle_comment(weird,info):
    	global RID
    	global RTOE
    	global status
    	infostr = str(info)
    	if infostr.startswith('QBlastInfoBegin'):
    		ridstr = infostr.find('RID')
    		rtoestr = infostr.find('RTOE')
    		rtoeend = infostr.find('QBlastInfoEnd')
    		if ridstr != -1:
    			if infostr[(rtoestr+7):(rtoeend-1)] != '':
    				RID = infostr[(ridstr+6):(rtoestr-5)]
    				RTOE = int(infostr[(rtoestr+7):(rtoeend-1)])
    				print 'RID = ' + RID
    				print 'Est Time = ' + str(RTOE)
    				
    			else:
    				print "There was an error."
    				sys.exit()
    		

parser = NCBIblastStatusParser()
for line in webpage:
	parser.feed(line)

time.sleep(RTOE)
check_count = 0
sleep_time = 10
while status == False:
	getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
		'CMD=Get&'
		'RID=' + RID )
	datapage = requests.get(getURL)

	for line in datapage:
		if line.find('READY') != -1:
			status = True
	print "Still Blasting"
	check_count += 1
	if check_count == 10:
		print "Adding 10 seconds to check interval"
		check_count = 0
		sleep_time += 10
	time.sleep(sleep_time)

getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
		'CMD=Get&'
		'FORMAT_TYPE=XML&'
		'MAX_NUM_SEQ=100000&'
		'RESULTS_FILE=yes&'
		'RID=' + RID )

r = requests.get(getURL, stream=True)
with open(RID+'.xml', 'wb') as f:
    for chunk in r.iter_content(chunk_size=1024): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
            f.flush()


print "Compleated BLAST"
This script is amazing! Is there any way to write a pipeline where it can serially check the SRA reads for a particular locus, that is check them individually? Also, the output is XML, is there anyway to see if there are hits or if there is no hits quickly? That's all that matters to me.
AdrianP is offline   Reply With Quote
Old 07-17-2019, 08:42 PM   #11
ludong_0
Junior Member
 
Location: Beijing

Join Date: Jul 2019
Posts: 2
Default

Quote:
Originally Posted by ynnus View Post
Yes I can, it's not well documented as that project kinda dissolved shortly after this was written. Hopefully it will help someone out.

Code:
#!/usr/bin/env python

import os
import urllib2
import urllib
import requests
import time
import sys

from HTMLParser import HTMLParser

fasta_file = raw_input('Enter Fasta File Location... ').strip()
database_num = str(raw_input('Enter SRR Databases to search separated by a single space...'))

if os.path.isfile(fasta_file):
	file_path = os.path.dirname(fasta_file) + '/'
	file_name = os.path.basename(fasta_file)
else:
    print "Invalid File Path, File Doesn't exist"
    sys.exit()
        

url = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi')
args = {'CMD' : 'Put','DATABASE':database_num,'PROGRAM':'blastn',
			'BLAST_PROGRAMS':'discoMegablast','MATCH_SCORES':'1,-2',
			'BLAST_SPEC':'SRA','FORMAT_TYPE':'XML','FILTER':'L',
			'HITLIST_SIZE':'20000','GAPCOSTS':'5 2','MAX_NUM_SEQ':'20000'}

req = requests.post(url,params=args,files={'QUERY': open(file_path + file_name, 'rb')})


webpage = req.text

RID = ''
RTOE = 0
status = False
class NCBIblastStatusParser(HTMLParser):
    def handle_comment(weird,info):
    	global RID
    	global RTOE
    	global status
    	infostr = str(info)
    	if infostr.startswith('QBlastInfoBegin'):
    		ridstr = infostr.find('RID')
    		rtoestr = infostr.find('RTOE')
    		rtoeend = infostr.find('QBlastInfoEnd')
    		if ridstr != -1:
    			if infostr[(rtoestr+7):(rtoeend-1)] != '':
    				RID = infostr[(ridstr+6):(rtoestr-5)]
    				RTOE = int(infostr[(rtoestr+7):(rtoeend-1)])
    				print 'RID = ' + RID
    				print 'Est Time = ' + str(RTOE)
    				
    			else:
    				print "There was an error."
    				sys.exit()
    		

parser = NCBIblastStatusParser()
for line in webpage:
	parser.feed(line)

time.sleep(RTOE)
check_count = 0
sleep_time = 10
while status == False:
	getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
		'CMD=Get&'
		'RID=' + RID )
	datapage = requests.get(getURL)

	for line in datapage:
		if line.find('READY') != -1:
			status = True
	print "Still Blasting"
	check_count += 1
	if check_count == 10:
		print "Adding 10 seconds to check interval"
		check_count = 0
		sleep_time += 10
	time.sleep(sleep_time)

getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
		'CMD=Get&'
		'FORMAT_TYPE=XML&'
		'MAX_NUM_SEQ=100000&'
		'RESULTS_FILE=yes&'
		'RID=' + RID )

r = requests.get(getURL, stream=True)
with open(RID+'.xml', 'wb') as f:
    for chunk in r.iter_content(chunk_size=1024): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
            f.flush()


print "Compleated BLAST"
When I run this codes, python3 tell me an error:
Traceback (most recent call last):
File "C:\bioimfomation-program\blast\blast_sra.py", line 82, in <module>
if line.find('READY') != -1:
TypeError: argument should be integer or bytes-like object, not 'str'
ludong_0 is offline   Reply With Quote
Old 07-18-2019, 12:45 AM   #12
ludong_0
Junior Member
 
Location: Beijing

Join Date: Jul 2019
Posts: 2
Default

Code:
import os
#import urllib2
import urllib3
import requests
import time
import sys

from html.parser import HTMLParser

fasta_file = "C:\\bioimfomation-program\\blast\\CBF-Sspon.05G0007920-1A.fasta".strip()
database_num = "SRR7771851"

if os.path.isfile(fasta_file):
    file_path = os.path.dirname(fasta_file) + '/'
    file_name = os.path.basename(fasta_file)
else:
    print("Invalid File Path, File Doesn't exist")
    sys.exit()


url = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi')
args = {'CMD' : 'Put','DATABASE':database_num,'PROGRAM':'blastn',
                        'BLAST_PROGRAMS':'discoMegablast','MATCH_SCORES':'1,-2',
                        'BLAST_SPEC':'SRA','FORMAT_TYPE':'XML','FILTER':'L',
                        'HITLIST_SIZE':'20000','GAPCOSTS':'5 2','MAX_NUM_SEQ':'20000'}

req = requests.post(url,params=args,files={'QUERY': open(file_path + file_name, 'r')})


webpage = req.text

RID = ''
RTOE = 0
status = False
class NCBIblastStatusParser(HTMLParser):
    def handle_comment(weird,info):
        global RID
        global RTOE
        global status
        infostr = str(info)
        if infostr.startswith('QBlastInfoBegin'):
            ridstr = infostr.find('RID')
            rtoestr = infostr.find('RTOE')
            rtoeend = infostr.find('QBlastInfoEnd')
            if ridstr != -1:
                if infostr[(rtoestr+7):(rtoeend-1)] != '':
                    RID = infostr[(ridstr+6):(rtoestr-5)]
                    RTOE = int(infostr[(rtoestr+7):(rtoeend-1)])
                    print('RID = ' + RID)
                    print('Est Time = ' + str(RTOE))
                else:
                    print("There was an error.")
                    sys.exit()

parser = NCBIblastStatusParser()
for line in webpage:
    parser.feed(line)

time.sleep(RTOE)
check_count = 0
sleep_time = 10
while status == False:
    getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
            'CMD=Get&'
            'RID=' + RID )
    datapage = requests.get(getURL)

    for line in datapage:
        line = line.decode()
        if line.find('READY') != -1:
            status = True
    print("Still Blasting")
    check_count += 1
    if check_count == 10:
        print("Adding 10 seconds to check interval")
        check_count = 0
        sleep_time += 10
    time.sleep(sleep_time)

getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
            'CMD=Get&'
            'FORMAT_TYPE=XML&'
            'MAX_NUM_SEQ=100000&'
            'RESULTS_FILE=yes&'
            'RID=' + RID )

r = requests.get(getURL, stream=True)
with open(RID+'.xml', 'w') as f:
    for chunk in r.iter_content(chunk_size=1024):
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
            f.flush()


print("Compleated BLAST")
I tried to make some changes in the code for Python3. It seemed work, but finally met an error:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='http', port=80): Max retries exceeded with url: //L0LO.COM (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000000038B45C0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed'))
ludong_0 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO