Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
example for using Picard removing duplicate reads? fabrice Bioinformatics 9 10-18-2013 02:32 AM
Removing similar sequence reads loba17 Bioinformatics 4 10-17-2011 07:31 AM
Removing duplicate reads for tophat? hong_sunwoo RNA Sequencing 2 10-09-2010 12:46 AM
Removing duplicate reads from multigig .csfasta Bueller_007 Bioinformatics 7 06-26-2010 03:07 PM
Removing an unknown primer sequence from reads. mchotalia Bioinformatics 2 09-12-2009 10:57 PM

Thread Tools
Old 04-04-2012, 02:32 AM   #1
Location: Germany

Join Date: Jun 2011
Posts: 54
Arrow Removing :Y: flagged reads

what is the best (i.e. quickest) tool to remove :Y: flagged reads from illumina data?

So far I used a perl script which is not very memory effective and I'd like to speed up this step.
lukas1848 is offline   Reply With Quote
Old 04-04-2012, 05:13 AM   #2
Location: Ireland

Join Date: Sep 2011
Posts: 86

You could do it in Python (code at the bottom). It takes one argument, the path to the fastq file to process.

Since python only stores one line in memory at once with this implementation, it should be pretty efficient. You can save time by piping the output of this file directly to the downstream program, using a line like this:
readmapper <(/path/to/ /path/to/reads.fastq)
This will call program 'readmapper' and pass it the output from the python script below as it runs, so you don't need to waste time and disc space making a huge output file. Modify as required for whatever program you want to run yourself.

Also, this program won't work with gzipped fastq files, but you can use the same piping concept to call gunzip and pass unzipped output straight to the script:
./ <(gunzip -c /path/to/reads.fastq.gz)
You can even pipe the gunzip output to and then pipe the output from that to your read mapper, all at once with no temporary files or delays! Linux is fun!
readmapper <(/path/to/ <(gunzip -c /path/to/reads.fastq.gz))
Python code:
import sys

	f = open(sys.argv[1])
	exit('Supply an input file!')

linecounter = 0
goodread = False
for line in f:
	linecounter = 1 if linecounter == 4 else linecounter+1
	if linecounter == 1:
		if ':N:' in line:
			goodread = True
			print line,
			goodread = False
	elif goodread:
		print line,
Rocketknight is offline   Reply With Quote
Old 04-04-2012, 10:30 AM   #3
Senior Member
Location: San Diego

Join Date: May 2008
Posts: 912

I always used zgrep. It did put '--' between the entries, but you can get rid of that, and it didn't seem to interfere with most programs anyway.
swbarnes2 is offline   Reply With Quote
Old 04-04-2012, 10:41 AM   #4
Senior Member
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503

Here's a Perl script that's quick and efficient (courtesy of James Platt):

#!/usr/bin perl

use warnings;

# usage: perl filename > outname
my $n = 0; #line counter

open (FILENAME, "$ARGV[0]") or die "Can't find $ARGV[0]\n";

while (<FILENAME>) {

	if ($_ =~ /^@\S+\s\d\:N/) {
		$n = 4;

	if ($n > 0) {
		print "$_\n";

HESmith is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 12:03 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO