Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to iterate over reads in file with HTSeq?

    I am currently attempting to write a program in Python (v2.7.6) and am using HTSeq to read in fastq files and then run my code on it. However my issue is that my code runs on the first read from each input file (pair end data so 2 input files) and then stops.

    Does anyone know how to run code on read1 and iterate to read 2 etc?

    Code is :

    HTML Code:
    # Import functions
    import HTSeq
    import itertools
    import numpy
    from matplotlib import pyplot
    from itertools import groupby
    import operator
    
    #Open files
    output_file= open('output.fq', "w")
    Unmatched1= open ('Unmatched1.fq', "w")
    Unmatched2= open ('Unmatched2.fq', "w")
    
    # Counts lines in input file as 'count' for future use, if needed
    print "%d lines in your choosen file" % len(open("Real_test_1").readlines())
    count = '%d'  % len(open("Real_test_1").readlines())
    print "Reads below"
    print count
    
    # Reads in files for use by HTSeq
    fastq_file1 = HTSeq.FastqReader( "Real_test_1", "phred")
    fastq_file2 = HTSeq.FastqReader( "Real_test_2", "phred")
    
    
    #Get rev_comp and rename seq1
    for read in fastq_file2:
    	rc2o = read
    	for read in fastq_file2:                                          
    		rc2=read.get_reverse_complement()                             
    		print rc2                                                     
    
    		for read in fastq_file1:
    			rc1o = read
    		for read in fastq_file1:
    		    rc1 = read
    		    print rc1
    
    		#Reverse seq2 again so can be matched
    		rc2w = rc2[::-1]
    
    		rc1u = rc1
    
    
    
    		while len(rc1u) > 20:
    		    slide_merge(rc1u, rc2w)
    		    rc1u = rc1u[1:]
    
    		merging
    
    
    		max(merging.iteritems(), key=operator.itemgetter(1))[0]
    
    		highest = max(merging.iteritems(), key=operator.itemgetter(1))[0]
    
    		highest
    
    		len(highest)
    
    		remove = len(highest)
    
    		if remove > 8:
    			rc1r = rc1[:-remove]
    			rc3 = rc1r+rc2w
    			rc3.write_to_fastq_file(output_file)
    		else:
    			rc1o.write_to_fastq_file(Unmatched1)
    			rc2o.write_to_fastq_file(Unmatched2)

    I tried putting aload of the code in a for loop however all that happened was it ran 1000 times due to that being how many lines were in each input file but ran on the first line each time and iterate through the reads.

    Code:
    HTML Code:
    for read in fastq_file2:
    	rc2o = read
    	for read in fastq_file2:                                          
    		rc2=read.get_reverse_complement()                             
    		print rc2                                                     
    
    		for read in fastq_file1:
    			rc1o = read
    		for read in fastq_file1:
    		    rc1 = read
    		    print rc1
    
    		#Reverse seq2 again so can be matched
    		rc2w = rc2[::-1]
    
    		rc1u = rc1
    
    
    
    		while len(rc1u) > 20:
    		    slide_merge(rc1u, rc2w)
    		    rc1u = rc1u[1:]
    
    		merging
    
    
    		max(merging.iteritems(), key=operator.itemgetter(1))[0]
    
    		highest = max(merging.iteritems(), key=operator.itemgetter(1))[0]
    
    		highest
    
    		len(highest)
    
    		remove = len(highest)
    
    		if remove > 8:
    			rc1r = rc1[:-remove]
    			rc3 = rc1r+rc2w
    			rc3.write_to_fastq_file(output_file)
    		else:
    			rc1o.write_to_fastq_file(Unmatched1)
    			rc2o.write_to_fastq_file(Unmatched2)
    Any help would be greatly appreciated.

    Thanks,
    Tom

  • #2
    ouff,
    as a start, replace your way too many for loops with:

    for read1, read2 in itertools.izip(fastq_file1, fastq_file2):

    then do all your processing inside this one loop.
    Good luck,
    Wolfgang

    Comment


    • #3
      Can you give a rough idea of what your code is supposed to do? I'm used to using HTSeq to iterate over mapped BAM files or GTF files, not FASTQ files, so can't quite work out how HTSeq would be useful for you here. My guess based on a rough look at the code is a paired-end read merger, but the lines with only "empty" and "merging" (not commented) are very confusing -- I'm guessing they're for debugging purposes.

      I've got a few comments about the code syntax, which may or may not affect the outcome:
      • reading entire files into memory just to count lines [len(open("Real_test_1").readlines())] is a little inefficient, and will make your code much slower than it needs to be.
      • when using loops, never use the same variable for an inner loop as used for an outer loop [you have "read" used in a loop and its child, and then again in a couple more loops].
      Last edited by gringer; 06-07-2014, 04:32 AM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin


        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
        Yesterday, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      43 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      43 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      38 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Working...
      X