Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tips on using nested for statements in Python to maximize program efficiency

    I am developing a Python script that will parse data from two input files into new output files using two nested for loops. One of the input files is a list of gene locations on a chromosome, while the other is a list of SNP locations on that same chromosome. The data in both files is ordered by position on the chromosome. The output files contain a list of SNPs which are located within each gene on the chromosome being analyzed.


    The first input file is read line by line into Python using a for loop. Within this for loop, the second input file is read line by line. Once certain criteria are met between the first and second sets of input files, the second for loop is closed with a break statement. The next iteration of the first for loop then begins.

    The problem with this script is that for each iteration of the first for loop (i.e. each line of the first input file), the second for loop starts reading the second input file from the very first line. This wastes a lot of time, as the second input file contains millions of lines of data. Does anyone know a technique to begin the next iteration of the first loop without beginning from the very first line of the second input file, i.e. a way to ‘save’ the iterator position on the second for loop?

    My script is below.

    import sys
    import fileinput
    import shlex

    nSNPsPerGene = open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')

    for i in fileinput.input("Gene Coordinates_full list.csv"):
    gene=shlex.shlex(i,posix=True)
    gene.whitespace += ','
    gene.whitespace_split = True
    gene=list(gene)
    geneStart=int(gene[2])
    geneStop=int(gene[3])
    output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(geneStart))), 'a')

    for line in fileinput.FileInput("SNPs-1.txt"):
    SNP=shlex.shlex(line,posix=True)
    SNP.whitespace += '\t'
    SNP.whitespace_split = True
    SNP=list(SNP)
    SNPlocation=int(SNP[0])
    if SNPlocation < geneStart:
    continue
    if SNPlocation >= geneStart and SNPlocation <= geneStop:
    output.write(("%s\n")%(str(SNP)))
    nSNPs=nSNPs+1
    else:
    nSNPsPerGene.write(("%s\t%s\t%s")%(str(geneStart),str(nSNPs),str(geneStop-geneStart)))
    break

  • #2
    If you don't have memory considerations, why don't you read in both files first, map them in a way useful for you (in a dict, or OrderedDict) and then iterate once over the first map?

    Comment


    • #3
      Originally posted by gwilymh View Post
      Does anyone know a technique to begin the next iteration of the first loop without beginning from the very first line of the second input file, i.e. a way to ‘save’ the iterator position on the second for loop?
      What about using the tell() and seek() file methods? They seem to do what you need. From python docs http://docs.python.org/2/tutorial/inputoutput.html:

      f.tell() returns an integer giving the file object’s current position in the file, measured in bytes from the beginning of the file. To change the file object’s position, use f.seek(offset, from_what)
      (However, depending exactly on what you need to do a better data structure like interval trees might scale better...)

      Best
      Dario

      Comment


      • #4
        Here's a tip, post in the correct forum!

        Moving to Bioinfx.

        Comment


        • #5
          First
          there are two ways to read lines from file and remember the 'position'
          Code:
              file=open('yourfile','r')
              file.readline()##read one line from file. if you call it the second times it will return the next line
              file.next()##use the generator. return one line from file. similar to readline()
          Second
          you could use pypy to accerelate your script(if your script contains a lot 'for' 'while' loops, use pypy would make it 10 times faster). also you could use file.readlines(10000) to read 10000 line each time to save I/O time.

          Comment


          • #6
            It sounds like you can just open file 1 and file 2 once BEFORE starting the nested loops, but perhaps I've not understood your problem fully.

            Based on the filenames you might have one line per gene in both files, so a loop iterating over both files together could work. For example something like this:

            Code:
            import itertools
            handle1 = open(...)
            handle2 = open(...)
            for line1, line2 in itertools.zip(handle1, handle):
                #assert line1 and line2 for same gene
            handle1.close()
            handle2.close()

            Comment


            • #7
              Originally posted by maubp View Post
              It sounds like you can just open file 1 and file 2 once BEFORE starting the nested loops, but perhaps I've not understood your problem fully.

              Based on the filenames you might have one line per gene in both files, so a loop iterating over both files together could work. For example something like this:

              Code:
              import itertools
              handle1 = open(...)
              handle2 = open(...)
              for line1, line2 in itertools.zip(handle1, handle):
                  #assert line1 and line2 for same gene
              handle1.close()
              handle2.close()

              Even if you don't have one line per gene, you can still use the same trick of opening the handles once:

              Code:
              handle1 = open(...)
              handle2 = open(...)
              
              for gene in handle1:
                  # do stuff
                  for snp in handle2:
                      # do stuff
                      if condition: 
                           break
              You'd have to be careful not to lose the first snp for each gene, of course.

              As a hint, there are code tags that you can use that will maintain the indentation of your post, which will make understanding your python code much easier.

              Comment


              • #8
                use bedtools or pybedtools in python. if your data is in bed format, this will make your script much faster and much simpler.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                22 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                17 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                48 views
                0 likes
                Last Post seqadmin  
                Working...
                X