SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
example for using Picard removing duplicate reads? fabrice Bioinformatics 9 10-18-2013 02:32 AM
Removing similar sequence reads loba17 Bioinformatics 4 10-17-2011 07:31 AM
Removing duplicate reads for tophat? hong_sunwoo RNA Sequencing 2 10-09-2010 12:46 AM
Removing duplicate reads from multigig .csfasta Bueller_007 Bioinformatics 7 06-26-2010 03:07 PM
Removing an unknown primer sequence from reads. mchotalia Bioinformatics 2 09-12-2009 10:57 PM

Reply
 
Thread Tools
Old 04-04-2012, 02:32 AM   #1
lukas1848
Member
 
Location: Germany

Join Date: Jun 2011
Posts: 54
Arrow Removing :Y: flagged reads

Hi,
what is the best (i.e. quickest) tool to remove :Y: flagged reads from illumina data?

So far I used a perl script which is not very memory effective and I'd like to speed up this step.
lukas1848 is offline   Reply With Quote
Old 04-04-2012, 05:13 AM   #2
Rocketknight
Member
 
Location: Ireland

Join Date: Sep 2011
Posts: 86
Default

You could do it in Python (code at the bottom). It takes one argument, the path to the fastq file to process.

Since python only stores one line in memory at once with this implementation, it should be pretty efficient. You can save time by piping the output of this file directly to the downstream program, using a line like this:
Code:
readmapper <(/path/to/removebadreads.py /path/to/reads.fastq)
This will call program 'readmapper' and pass it the output from the python script below as it runs, so you don't need to waste time and disc space making a huge output file. Modify as required for whatever program you want to run yourself.

Also, this program won't work with gzipped fastq files, but you can use the same piping concept to call gunzip and pass unzipped output straight to the script:
Code:
./removebadreads.py <(gunzip -c /path/to/reads.fastq.gz)
You can even pipe the gunzip output to removebadreads.py and then pipe the output from that to your read mapper, all at once with no temporary files or delays! Linux is fun!
Code:
readmapper <(/path/to/removebadreads.py <(gunzip -c /path/to/reads.fastq.gz))
Python code:
Code:
#!/usr/bin/python
import sys

try:
	f = open(sys.argv[1])
except:
	exit('Supply an input file!')

linecounter = 0
goodread = False
for line in f:
	linecounter = 1 if linecounter == 4 else linecounter+1
	if linecounter == 1:
		if ':N:' in line:
			goodread = True
			print line,
		else:
			goodread = False
	elif goodread:
		print line,
Rocketknight is offline   Reply With Quote
Old 04-04-2012, 10:30 AM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

I always used zgrep. It did put '--' between the entries, but you can get rid of that, and it didn't seem to interfere with most programs anyway.
swbarnes2 is offline   Reply With Quote
Old 04-04-2012, 10:41 AM   #4
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

Here's a Perl script that's quick and efficient (courtesy of James Platt):

Code:
#!/usr/bin perl

use warnings;

# usage: perl Illumina_fastq_filter-JP.pl filename > outname
my $n = 0; #line counter

open (FILENAME, "$ARGV[0]") or die "Can't find $ARGV[0]\n";

while (<FILENAME>) {
	chomp;

	if ($_ =~ /^@\S+\s\d\:N/) {
		$n = 4;
	}

	if ($n > 0) {
		print "$_\n";
		$n--;
	}
}

close FILENAME;
HESmith is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:03 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO