Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • python script to bast NCBI's SRA database

    I have a working python script that uses NCBIWWW to blast fasta files against the nt database.

    I have been trying to find a way to do the same thing against NCBIs sequence read archive (SRA).

    Does anyone have any idea on how to do this? Can I do this with qblast in NCBIWWW?

  • #2
    I don't think the whole sra is available for online blast. Also, I don't really understand for what you need a python script for since there's the "-remote" flag in standalone blast..
    Last edited by rhinoceros; 11-15-2013, 12:04 AM.
    savetherhino.org

    Comment


    • #3
      They offer a blast on the website (categorized under specialized blast); the reason I made a script was when blasting against the SRA you select the individual data sets to blast against. I was hoping to write the script to automatically blast several nucleotide sequences against each dataset sequentially, while I can do this by hand with their website. The script would save me a lot of time.

      Comment


      • #4
        I see. The NCBI url api documentation could help you with your script.

        edit. I just highlighted some relevant parts in www sra-blast. Shouldn't be too difficult to alter your script to make it work? Just make a list of all SRRs and loop over that in the DATABASE part.

        If you manage to make it work, be sure to report back here

        Code:
        http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&USER_FORMAT_DEFAULTS=on&PAGE=MegaBlast&PROGRAM=blastn&
        QUERY=XM_003757795&GAPCOSTS=0%200&MATCH_SCORES=1,-2&BLAST_SPEC=SRA&DATABASE=SRR027154%20SRR027155%20SRR027156&
        BLAST_PROGRAMS=megaBlast&MAX_NUM_SEQ=1000&EXPECT=1e-5&WORD_SIZE=28&TEMPLATE_TYPE=0&TEMPLATE_LENGTH=0&FILTER=L&FILTER=m&
        WWW_BLAST_TYPE=newblast&SHOW_OVERVIEW=true&SHOW_LINKOUT=false&ALIGNMENT_VIEW=Pairwise&MASK_CHAR=2&MASK_COLOR=1&
        GET_SEQUENCE=false&NEW_VIEW=false&NCBI_GI=false&NUM_OVERVIEW=100&DESCRIPTIONS=1000&ALIGNMENTS=1000&FORMAT_OBJECT=Alignment&
        FORMAT_TYPE=HTML&SHOW_CDS_FEATURE=false&OLD_BLAST=false&CMD=REQUEST
        Added random linebreaks to make it more readable.

        Edit again.

        I've never used ncbi's url api but I've written a script for eutils. If these apis follow the same logic, then solving your problem would be quite trivial. If the parameters don't have to be in some specific order, you would in essence assign

        base="Blast.cgi?PAGE_TYPE=BlastSearch&USER_FORMAT_DEFAULTS=on&PAGE=MegaBlast&PROGRAM=blastn&.."
        query="$line"
        db="BLAST_SPEC=SRA&DATABASE=$srr"

        and then just assemble urls by looping over your queries and srr list. If qblast is anything like -remote flag in standalone blast, be sure to prepare for connection errors too..
        Last edited by rhinoceros; 11-15-2013, 12:24 PM.
        savetherhino.org

        Comment


        • #5
          So I was able to get it working using the url api and everything works very well. However I am limited to only using short sequences that don't exceed the URL size limit. I know there is an option in the url for QUERY_FILE = (which by the documentation works for fasta format files) I can't seem to figure out how to pass a fasta file through the url API. Once I get this working I'll be happy to post my script so that others that might be interested in this can use it.

          Comment


          • #6
            Just got file uploading working, disregard the last post

            Comment


            • #7
              hi ynnus,

              are you able to share the Python script with us?

              Comment


              • #8
                Yes I can, it's not well documented as that project kinda dissolved shortly after this was written. Hopefully it will help someone out.

                Code:
                #!/usr/bin/env python
                
                import os
                import urllib2
                import urllib
                import requests
                import time
                import sys
                
                from HTMLParser import HTMLParser
                
                fasta_file = raw_input('Enter Fasta File Location... ').strip()
                database_num = str(raw_input('Enter SRR Databases to search separated by a single space...'))
                
                if os.path.isfile(fasta_file):
                	file_path = os.path.dirname(fasta_file) + '/'
                	file_name = os.path.basename(fasta_file)
                else:
                    print "Invalid File Path, File Doesn't exist"
                    sys.exit()
                        
                
                url = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi')
                args = {'CMD' : 'Put','DATABASE':database_num,'PROGRAM':'blastn',
                			'BLAST_PROGRAMS':'discoMegablast','MATCH_SCORES':'1,-2',
                			'BLAST_SPEC':'SRA','FORMAT_TYPE':'XML','FILTER':'L',
                			'HITLIST_SIZE':'20000','GAPCOSTS':'5 2','MAX_NUM_SEQ':'20000'}
                
                req = requests.post(url,params=args,files={'QUERY': open(file_path + file_name, 'rb')})
                
                
                webpage = req.text
                
                RID = ''
                RTOE = 0
                status = False
                class NCBIblastStatusParser(HTMLParser):
                    def handle_comment(weird,info):
                    	global RID
                    	global RTOE
                    	global status
                    	infostr = str(info)
                    	if infostr.startswith('QBlastInfoBegin'):
                    		ridstr = infostr.find('RID')
                    		rtoestr = infostr.find('RTOE')
                    		rtoeend = infostr.find('QBlastInfoEnd')
                    		if ridstr != -1:
                    			if infostr[(rtoestr+7):(rtoeend-1)] != '':
                    				RID = infostr[(ridstr+6):(rtoestr-5)]
                    				RTOE = int(infostr[(rtoestr+7):(rtoeend-1)])
                    				print 'RID = ' + RID
                    				print 'Est Time = ' + str(RTOE)
                    				
                    			else:
                    				print "There was an error."
                    				sys.exit()
                    		
                
                parser = NCBIblastStatusParser()
                for line in webpage:
                	parser.feed(line)
                
                time.sleep(RTOE)
                check_count = 0
                sleep_time = 10
                while status == False:
                	getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
                		'CMD=Get&'
                		'RID=' + RID )
                	datapage = requests.get(getURL)
                
                	for line in datapage:
                		if line.find('READY') != -1:
                			status = True
                	print "Still Blasting"
                	check_count += 1
                	if check_count == 10:
                		print "Adding 10 seconds to check interval"
                		check_count = 0
                		sleep_time += 10
                	time.sleep(sleep_time)
                
                getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
                		'CMD=Get&'
                		'FORMAT_TYPE=XML&'
                		'MAX_NUM_SEQ=100000&'
                		'RESULTS_FILE=yes&'
                		'RID=' + RID )
                
                r = requests.get(getURL, stream=True)
                with open(RID+'.xml', 'wb') as f:
                    for chunk in r.iter_content(chunk_size=1024): 
                        if chunk: # filter out keep-alive new chunks
                            f.write(chunk)
                            f.flush()
                
                
                print "Compleated BLAST"

                Comment


                • #9
                  much obliged

                  Comment


                  • #10
                    Originally posted by ynnus View Post
                    Yes I can, it's not well documented as that project kinda dissolved shortly after this was written. Hopefully it will help someone out.

                    Code:
                    #!/usr/bin/env python
                    
                    import os
                    import urllib2
                    import urllib
                    import requests
                    import time
                    import sys
                    
                    from HTMLParser import HTMLParser
                    
                    fasta_file = raw_input('Enter Fasta File Location... ').strip()
                    database_num = str(raw_input('Enter SRR Databases to search separated by a single space...'))
                    
                    if os.path.isfile(fasta_file):
                    	file_path = os.path.dirname(fasta_file) + '/'
                    	file_name = os.path.basename(fasta_file)
                    else:
                        print "Invalid File Path, File Doesn't exist"
                        sys.exit()
                            
                    
                    url = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi')
                    args = {'CMD' : 'Put','DATABASE':database_num,'PROGRAM':'blastn',
                    			'BLAST_PROGRAMS':'discoMegablast','MATCH_SCORES':'1,-2',
                    			'BLAST_SPEC':'SRA','FORMAT_TYPE':'XML','FILTER':'L',
                    			'HITLIST_SIZE':'20000','GAPCOSTS':'5 2','MAX_NUM_SEQ':'20000'}
                    
                    req = requests.post(url,params=args,files={'QUERY': open(file_path + file_name, 'rb')})
                    
                    
                    webpage = req.text
                    
                    RID = ''
                    RTOE = 0
                    status = False
                    class NCBIblastStatusParser(HTMLParser):
                        def handle_comment(weird,info):
                        	global RID
                        	global RTOE
                        	global status
                        	infostr = str(info)
                        	if infostr.startswith('QBlastInfoBegin'):
                        		ridstr = infostr.find('RID')
                        		rtoestr = infostr.find('RTOE')
                        		rtoeend = infostr.find('QBlastInfoEnd')
                        		if ridstr != -1:
                        			if infostr[(rtoestr+7):(rtoeend-1)] != '':
                        				RID = infostr[(ridstr+6):(rtoestr-5)]
                        				RTOE = int(infostr[(rtoestr+7):(rtoeend-1)])
                        				print 'RID = ' + RID
                        				print 'Est Time = ' + str(RTOE)
                        				
                        			else:
                        				print "There was an error."
                        				sys.exit()
                        		
                    
                    parser = NCBIblastStatusParser()
                    for line in webpage:
                    	parser.feed(line)
                    
                    time.sleep(RTOE)
                    check_count = 0
                    sleep_time = 10
                    while status == False:
                    	getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
                    		'CMD=Get&'
                    		'RID=' + RID )
                    	datapage = requests.get(getURL)
                    
                    	for line in datapage:
                    		if line.find('READY') != -1:
                    			status = True
                    	print "Still Blasting"
                    	check_count += 1
                    	if check_count == 10:
                    		print "Adding 10 seconds to check interval"
                    		check_count = 0
                    		sleep_time += 10
                    	time.sleep(sleep_time)
                    
                    getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
                    		'CMD=Get&'
                    		'FORMAT_TYPE=XML&'
                    		'MAX_NUM_SEQ=100000&'
                    		'RESULTS_FILE=yes&'
                    		'RID=' + RID )
                    
                    r = requests.get(getURL, stream=True)
                    with open(RID+'.xml', 'wb') as f:
                        for chunk in r.iter_content(chunk_size=1024): 
                            if chunk: # filter out keep-alive new chunks
                                f.write(chunk)
                                f.flush()
                    
                    
                    print "Compleated BLAST"
                    This script is amazing! Is there any way to write a pipeline where it can serially check the SRA reads for a particular locus, that is check them individually? Also, the output is XML, is there anyway to see if there are hits or if there is no hits quickly? That's all that matters to me.

                    Comment


                    • #11
                      Originally posted by ynnus View Post
                      Yes I can, it's not well documented as that project kinda dissolved shortly after this was written. Hopefully it will help someone out.

                      Code:
                      #!/usr/bin/env python
                      
                      import os
                      import urllib2
                      import urllib
                      import requests
                      import time
                      import sys
                      
                      from HTMLParser import HTMLParser
                      
                      fasta_file = raw_input('Enter Fasta File Location... ').strip()
                      database_num = str(raw_input('Enter SRR Databases to search separated by a single space...'))
                      
                      if os.path.isfile(fasta_file):
                      	file_path = os.path.dirname(fasta_file) + '/'
                      	file_name = os.path.basename(fasta_file)
                      else:
                          print "Invalid File Path, File Doesn't exist"
                          sys.exit()
                              
                      
                      url = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi')
                      args = {'CMD' : 'Put','DATABASE':database_num,'PROGRAM':'blastn',
                      			'BLAST_PROGRAMS':'discoMegablast','MATCH_SCORES':'1,-2',
                      			'BLAST_SPEC':'SRA','FORMAT_TYPE':'XML','FILTER':'L',
                      			'HITLIST_SIZE':'20000','GAPCOSTS':'5 2','MAX_NUM_SEQ':'20000'}
                      
                      req = requests.post(url,params=args,files={'QUERY': open(file_path + file_name, 'rb')})
                      
                      
                      webpage = req.text
                      
                      RID = ''
                      RTOE = 0
                      status = False
                      class NCBIblastStatusParser(HTMLParser):
                          def handle_comment(weird,info):
                          	global RID
                          	global RTOE
                          	global status
                          	infostr = str(info)
                          	if infostr.startswith('QBlastInfoBegin'):
                          		ridstr = infostr.find('RID')
                          		rtoestr = infostr.find('RTOE')
                          		rtoeend = infostr.find('QBlastInfoEnd')
                          		if ridstr != -1:
                          			if infostr[(rtoestr+7):(rtoeend-1)] != '':
                          				RID = infostr[(ridstr+6):(rtoestr-5)]
                          				RTOE = int(infostr[(rtoestr+7):(rtoeend-1)])
                          				print 'RID = ' + RID
                          				print 'Est Time = ' + str(RTOE)
                          				
                          			else:
                          				print "There was an error."
                          				sys.exit()
                          		
                      
                      parser = NCBIblastStatusParser()
                      for line in webpage:
                      	parser.feed(line)
                      
                      time.sleep(RTOE)
                      check_count = 0
                      sleep_time = 10
                      while status == False:
                      	getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
                      		'CMD=Get&'
                      		'RID=' + RID )
                      	datapage = requests.get(getURL)
                      
                      	for line in datapage:
                      		if line.find('READY') != -1:
                      			status = True
                      	print "Still Blasting"
                      	check_count += 1
                      	if check_count == 10:
                      		print "Adding 10 seconds to check interval"
                      		check_count = 0
                      		sleep_time += 10
                      	time.sleep(sleep_time)
                      
                      getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
                      		'CMD=Get&'
                      		'FORMAT_TYPE=XML&'
                      		'MAX_NUM_SEQ=100000&'
                      		'RESULTS_FILE=yes&'
                      		'RID=' + RID )
                      
                      r = requests.get(getURL, stream=True)
                      with open(RID+'.xml', 'wb') as f:
                          for chunk in r.iter_content(chunk_size=1024): 
                              if chunk: # filter out keep-alive new chunks
                                  f.write(chunk)
                                  f.flush()
                      
                      
                      print "Compleated BLAST"
                      When I run this codes, python3 tell me an error:
                      Traceback (most recent call last):
                      File "C:\bioimfomation-program\blast\blast_sra.py", line 82, in <module>
                      if line.find('READY') != -1:
                      TypeError: argument should be integer or bytes-like object, not 'str'

                      Comment


                      • #12
                        Code:
                        import os
                        #import urllib2
                        import urllib3
                        import requests
                        import time
                        import sys
                        
                        from html.parser import HTMLParser
                        
                        fasta_file = "C:\\bioimfomation-program\\blast\\CBF-Sspon.05G0007920-1A.fasta".strip()
                        database_num = "SRR7771851"
                        
                        if os.path.isfile(fasta_file):
                            file_path = os.path.dirname(fasta_file) + '/'
                            file_name = os.path.basename(fasta_file)
                        else:
                            print("Invalid File Path, File Doesn't exist")
                            sys.exit()
                        
                        
                        url = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi')
                        args = {'CMD' : 'Put','DATABASE':database_num,'PROGRAM':'blastn',
                                                'BLAST_PROGRAMS':'discoMegablast','MATCH_SCORES':'1,-2',
                                                'BLAST_SPEC':'SRA','FORMAT_TYPE':'XML','FILTER':'L',
                                                'HITLIST_SIZE':'20000','GAPCOSTS':'5 2','MAX_NUM_SEQ':'20000'}
                        
                        req = requests.post(url,params=args,files={'QUERY': open(file_path + file_name, 'r')})
                        
                        
                        webpage = req.text
                        
                        RID = ''
                        RTOE = 0
                        status = False
                        class NCBIblastStatusParser(HTMLParser):
                            def handle_comment(weird,info):
                                global RID
                                global RTOE
                                global status
                                infostr = str(info)
                                if infostr.startswith('QBlastInfoBegin'):
                                    ridstr = infostr.find('RID')
                                    rtoestr = infostr.find('RTOE')
                                    rtoeend = infostr.find('QBlastInfoEnd')
                                    if ridstr != -1:
                                        if infostr[(rtoestr+7):(rtoeend-1)] != '':
                                            RID = infostr[(ridstr+6):(rtoestr-5)]
                                            RTOE = int(infostr[(rtoestr+7):(rtoeend-1)])
                                            print('RID = ' + RID)
                                            print('Est Time = ' + str(RTOE))
                                        else:
                                            print("There was an error.")
                                            sys.exit()
                        
                        parser = NCBIblastStatusParser()
                        for line in webpage:
                            parser.feed(line)
                        
                        time.sleep(RTOE)
                        check_count = 0
                        sleep_time = 10
                        while status == False:
                            getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
                                    'CMD=Get&'
                                    'RID=' + RID )
                            datapage = requests.get(getURL)
                        
                            for line in datapage:
                                line = line.decode()
                                if line.find('READY') != -1:
                                    status = True
                            print("Still Blasting")
                            check_count += 1
                            if check_count == 10:
                                print("Adding 10 seconds to check interval")
                                check_count = 0
                                sleep_time += 10
                            time.sleep(sleep_time)
                        
                        getURL = ('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'
                                    'CMD=Get&'
                                    'FORMAT_TYPE=XML&'
                                    'MAX_NUM_SEQ=100000&'
                                    'RESULTS_FILE=yes&'
                                    'RID=' + RID )
                        
                        r = requests.get(getURL, stream=True)
                        with open(RID+'.xml', 'w') as f:
                            for chunk in r.iter_content(chunk_size=1024):
                                if chunk: # filter out keep-alive new chunks
                                    f.write(chunk)
                                    f.flush()
                        
                        
                        print("Compleated BLAST")
                        I tried to make some changes in the code for Python3. It seemed work, but finally met an error:
                        requests.exceptions.ConnectionError: HTTPConnectionPool(host='http', port=80): Max retries exceeded with url: //L0LO.COM (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000000038B45C0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed'))

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        9 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        50 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X