Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to iterate over reads in file with HTSeq?

    I am currently attempting to write a program in Python (v2.7.6) and am using HTSeq to read in fastq files and then run my code on it. However my issue is that my code runs on the first read from each input file (pair end data so 2 input files) and then stops.

    Does anyone know how to run code on read1 and iterate to read 2 etc?

    Code is :

    HTML Code:
    # Import functions
    import HTSeq
    import itertools
    import numpy
    from matplotlib import pyplot
    from itertools import groupby
    import operator
    
    #Open files
    output_file= open('output.fq', "w")
    Unmatched1= open ('Unmatched1.fq', "w")
    Unmatched2= open ('Unmatched2.fq', "w")
    
    # Counts lines in input file as 'count' for future use, if needed
    print "%d lines in your choosen file" % len(open("Real_test_1").readlines())
    count = '%d'  % len(open("Real_test_1").readlines())
    print "Reads below"
    print count
    
    # Reads in files for use by HTSeq
    fastq_file1 = HTSeq.FastqReader( "Real_test_1", "phred")
    fastq_file2 = HTSeq.FastqReader( "Real_test_2", "phred")
    
    
    #Get rev_comp and rename seq1
    for read in fastq_file2:
    	rc2o = read
    	for read in fastq_file2:                                          
    		rc2=read.get_reverse_complement()                             
    		print rc2                                                     
    
    		for read in fastq_file1:
    			rc1o = read
    		for read in fastq_file1:
    		    rc1 = read
    		    print rc1
    
    		#Reverse seq2 again so can be matched
    		rc2w = rc2[::-1]
    
    		rc1u = rc1
    
    
    
    		while len(rc1u) > 20:
    		    slide_merge(rc1u, rc2w)
    		    rc1u = rc1u[1:]
    
    		merging
    
    
    		max(merging.iteritems(), key=operator.itemgetter(1))[0]
    
    		highest = max(merging.iteritems(), key=operator.itemgetter(1))[0]
    
    		highest
    
    		len(highest)
    
    		remove = len(highest)
    
    		if remove > 8:
    			rc1r = rc1[:-remove]
    			rc3 = rc1r+rc2w
    			rc3.write_to_fastq_file(output_file)
    		else:
    			rc1o.write_to_fastq_file(Unmatched1)
    			rc2o.write_to_fastq_file(Unmatched2)

    I tried putting aload of the code in a for loop however all that happened was it ran 1000 times due to that being how many lines were in each input file but ran on the first line each time and iterate through the reads.

    Code:
    HTML Code:
    for read in fastq_file2:
    	rc2o = read
    	for read in fastq_file2:                                          
    		rc2=read.get_reverse_complement()                             
    		print rc2                                                     
    
    		for read in fastq_file1:
    			rc1o = read
    		for read in fastq_file1:
    		    rc1 = read
    		    print rc1
    
    		#Reverse seq2 again so can be matched
    		rc2w = rc2[::-1]
    
    		rc1u = rc1
    
    
    
    		while len(rc1u) > 20:
    		    slide_merge(rc1u, rc2w)
    		    rc1u = rc1u[1:]
    
    		merging
    
    
    		max(merging.iteritems(), key=operator.itemgetter(1))[0]
    
    		highest = max(merging.iteritems(), key=operator.itemgetter(1))[0]
    
    		highest
    
    		len(highest)
    
    		remove = len(highest)
    
    		if remove > 8:
    			rc1r = rc1[:-remove]
    			rc3 = rc1r+rc2w
    			rc3.write_to_fastq_file(output_file)
    		else:
    			rc1o.write_to_fastq_file(Unmatched1)
    			rc2o.write_to_fastq_file(Unmatched2)
    Any help would be greatly appreciated.

    Thanks,
    Tom

  • #2
    ouff,
    as a start, replace your way too many for loops with:

    for read1, read2 in itertools.izip(fastq_file1, fastq_file2):

    then do all your processing inside this one loop.
    Good luck,
    Wolfgang

    Comment


    • #3
      Can you give a rough idea of what your code is supposed to do? I'm used to using HTSeq to iterate over mapped BAM files or GTF files, not FASTQ files, so can't quite work out how HTSeq would be useful for you here. My guess based on a rough look at the code is a paired-end read merger, but the lines with only "empty" and "merging" (not commented) are very confusing -- I'm guessing they're for debugging purposes.

      I've got a few comments about the code syntax, which may or may not affect the outcome:
      • reading entire files into memory just to count lines [len(open("Real_test_1").readlines())] is a little inefficient, and will make your code much slower than it needs to be.
      • when using loops, never use the same variable for an inner loop as used for an outer loop [you have "read" used in a loop and its child, and then again in a couple more loops].
      Last edited by gringer; 06-07-2014, 04:32 AM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      18 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      22 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      47 views
      0 likes
      Last Post seqadmin  
      Working...
      X