Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tips on using nested for statements in Python to maximize program efficiency

    I am developing a Python script that will parse data from two input files into new output files using two nested for loops. One of the input files is a list of gene locations on a chromosome, while the other is a list of SNP locations on that same chromosome. The data in both files is ordered by position on the chromosome. The output files contain a list of SNPs which are located within each gene on the chromosome being analyzed.


    The first input file is read line by line into Python using a for loop. Within this for loop, the second input file is read line by line. Once certain criteria are met between the first and second sets of input files, the second for loop is closed with a break statement. The next iteration of the first for loop then begins.

    The problem with this script is that for each iteration of the first for loop (i.e. each line of the first input file), the second for loop starts reading the second input file from the very first line. This wastes a lot of time, as the second input file contains millions of lines of data. Does anyone know a technique to begin the next iteration of the first loop without beginning from the very first line of the second input file, i.e. a way to ‘save’ the iterator position on the second for loop?

    My script is below.

    import sys
    import fileinput
    import shlex

    nSNPsPerGene = open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')

    for i in fileinput.input("Gene Coordinates_full list.csv"):
    gene=shlex.shlex(i,posix=True)
    gene.whitespace += ','
    gene.whitespace_split = True
    gene=list(gene)
    geneStart=int(gene[2])
    geneStop=int(gene[3])
    output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(geneStart))), 'a')

    for line in fileinput.FileInput("SNPs-1.txt"):
    SNP=shlex.shlex(line,posix=True)
    SNP.whitespace += '\t'
    SNP.whitespace_split = True
    SNP=list(SNP)
    SNPlocation=int(SNP[0])
    if SNPlocation < geneStart:
    continue
    if SNPlocation >= geneStart and SNPlocation <= geneStop:
    output.write(("%s\n")%(str(SNP)))
    nSNPs=nSNPs+1
    else:
    nSNPsPerGene.write(("%s\t%s\t%s")%(str(geneStart),str(nSNPs),str(geneStop-geneStart)))
    break

  • #2
    If you don't have memory considerations, why don't you read in both files first, map them in a way useful for you (in a dict, or OrderedDict) and then iterate once over the first map?

    Comment


    • #3
      Originally posted by gwilymh View Post
      Does anyone know a technique to begin the next iteration of the first loop without beginning from the very first line of the second input file, i.e. a way to ‘save’ the iterator position on the second for loop?
      What about using the tell() and seek() file methods? They seem to do what you need. From python docs http://docs.python.org/2/tutorial/inputoutput.html:

      f.tell() returns an integer giving the file object’s current position in the file, measured in bytes from the beginning of the file. To change the file object’s position, use f.seek(offset, from_what)
      (However, depending exactly on what you need to do a better data structure like interval trees might scale better...)

      Best
      Dario

      Comment


      • #4
        Here's a tip, post in the correct forum!

        Moving to Bioinfx.

        Comment


        • #5
          First
          there are two ways to read lines from file and remember the 'position'
          Code:
              file=open('yourfile','r')
              file.readline()##read one line from file. if you call it the second times it will return the next line
              file.next()##use the generator. return one line from file. similar to readline()
          Second
          you could use pypy to accerelate your script(if your script contains a lot 'for' 'while' loops, use pypy would make it 10 times faster). also you could use file.readlines(10000) to read 10000 line each time to save I/O time.

          Comment


          • #6
            It sounds like you can just open file 1 and file 2 once BEFORE starting the nested loops, but perhaps I've not understood your problem fully.

            Based on the filenames you might have one line per gene in both files, so a loop iterating over both files together could work. For example something like this:

            Code:
            import itertools
            handle1 = open(...)
            handle2 = open(...)
            for line1, line2 in itertools.zip(handle1, handle):
                #assert line1 and line2 for same gene
            handle1.close()
            handle2.close()

            Comment


            • #7
              Originally posted by maubp View Post
              It sounds like you can just open file 1 and file 2 once BEFORE starting the nested loops, but perhaps I've not understood your problem fully.

              Based on the filenames you might have one line per gene in both files, so a loop iterating over both files together could work. For example something like this:

              Code:
              import itertools
              handle1 = open(...)
              handle2 = open(...)
              for line1, line2 in itertools.zip(handle1, handle):
                  #assert line1 and line2 for same gene
              handle1.close()
              handle2.close()

              Even if you don't have one line per gene, you can still use the same trick of opening the handles once:

              Code:
              handle1 = open(...)
              handle2 = open(...)
              
              for gene in handle1:
                  # do stuff
                  for snp in handle2:
                      # do stuff
                      if condition: 
                           break
              You'd have to be careful not to lose the first snp for each gene, of course.

              As a hint, there are code tags that you can use that will maintain the indentation of your post, which will make understanding your python code much easier.

              Comment


              • #8
                use bedtools or pybedtools in python. if your data is in bed format, this will make your script much faster and much simpler.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 02:46 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-06-2024, 07:17 AM
                0 responses
                17 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-02-2024, 08:06 AM
                0 responses
                23 views
                0 likes
                Last Post seqadmin  
                Working...
                X