Seqanswers Leaderboard Ad

**kmcarr** · 03-29-2011, 09:28 AM

S.

I haven't used it myself but Bio::Index::Fastq looks like it could be useful. Using indexed fastq files you don't need to keep the entirety of your data in memory.

My first try would be to index both the files and then create a list of ids with both fwd/rev reads. Then stream through the list and output new fwd/rev fastq files with just the paired reads, in the same order.

Update 2012-01-05:

To anyone reading this thread anew, don't bother trying to use Bio::Index::Fastq (at least as it exists on this date). It simply does not scale to the data volumes coming from Illumina instruments. If you attempt to use it your computer will literally catch on fire. OK, not really, but it will likely fill all its memory and swap and come to a grinding halt.

See my new message below.

**suryasaha** · 03-29-2011, 09:57 AM

Good suggestion! I have used other indexing modules from BioPerl. I am trying out a Perl hack right now but it that doesn't work, I will give Bio::Index::Fastq a shot. Thanks!

Best,
Surya

**greigite** · 03-29-2011, 11:01 AM

I wrote a perl script to do this using grep- it works by retrieving read names using the fact that every read name has a "@" before it. It works but is slow, even though it doesn't need to read all the files into memory. I'm sure someone could improve on it...

Code:

#!/usr/bin/env perl

# Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
# Input is quality filtered files for reads 1 and 2
# Output is two new fastq files, one for each read containing only shared reads
# Reads will be output in the same order in each file
# greigite 3.2011

if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}

my (@readsone,@readstwo);
my $filenameone = $ARGV[0] . ".shared";
my $filenametwo = $ARGV[1] . ".shared";
#my $logfile = "unshared_reads.log";

open my $readoneout, ">>$filenameone";
open my $readtwoout, ">>$filenametwo";
#open my $logfh, ">>$logfile";

# build arrays of read names
my @readsonetmp = `grep "@" $ARGV[0]`;
my @readstwotmp = `grep "@" $ARGV[1]`;

my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; #trim off read identifier "/1"
my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; #trim off read identifier "/2"

foreach my $readone(@readsone){
  my (@printonetrim,@printtwotrim);
  if (grep ($_ eq $readone, @readstwo)) {
     my $lineone = `grep -n $readone $ARGV[0]`;
     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printone = `head -$linenum $ARGV[0] | tail -4`;

     my $linetwo = `grep -n $readone $ARGV[1]`;
     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;

     print $readoneout @printone;
      print $readtwoout @printtwo;
  }
}

**swbarnes2** · 03-29-2011, 11:26 AM

Originally posted by greigite View Post

I wrote a perl script to do this using grep- it works by retrieving read names using the fact that every read name has a "@" before it. It works but is slow, even though it doesn't need to read all the files into memory. I'm sure someone could improve on it...

Be very careful with that. In some quality scales, @ is a valid quality score, so you could match to that, instead of name, if its first in the string.

@NAME is probably a safer thing to grep for, the odds of your quality score having multiple letters in a row in common with your name are much lower.

**greigite** · 03-29-2011, 11:46 AM

Good point, swbarnes. I haven't had any problems occur yet, but better not to take chances. Here is a corrected version:

Code:

#!/usr/bin/env perl

# Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
# Input is quality filtered files for reads 1 and 2
# Output is two new fastq files, one for each read containing only shared reads
# Reads will be output in the same order in each file
# greigite 3.2011

if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}

my (@readsone,@readstwo,@combined);
my $filenameone = $ARGV[0] . ".shared";
my $filenametwo = $ARGV[1] . ".shared";
#my $logfile = "unshared_reads.log";

open my $readoneout, ">>$filenameone";
open my $readtwoout, ">>$filenametwo";
#open my $logfh, ">>$logfile";

# build arrays of read names
my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;

my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; 

foreach my $readone(@readsone){
  my (@printonetrim,@printtwotrim);
  if (grep ($_ eq $readone, @readstwo)) {
     my $lineone = `grep -n $readone $ARGV[0]`;
     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printone = `head -$linenum $ARGV[0] | tail -4`;

     my $linetwo = `grep -n $readone $ARGV[1]`;
     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;

     print $readoneout @printone;
      print $readtwoout @printtwo;
  }
}

**suryasaha** · 03-30-2011, 05:44 AM

Thank you for the replies, greigite and swbarnes. Just finished parsing the files.

**elisadouzi** · 03-31-2011, 03:16 PM

I have two pair-end fastq files. I timmed the 3' of the reads with low quality score, so the length of each reads of the pair is not same. But the order and the number of the reads are same. Will it effects bwa function ? I have some problem of sampe. I do not know wether it is this problem?

Originally posted by greigite View Post

Good point, swbarnes. I haven't had any problems occur yet, but better not to take chances. Here is a corrected version:

Code:

#!/usr/bin/env perl

# Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
# Input is quality filtered files for reads 1 and 2
# Output is two new fastq files, one for each read containing only shared reads
# Reads will be output in the same order in each file
# greigite 3.2011

if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}

my (@readsone,@readstwo,@combined);
my $filenameone = $ARGV[0] . ".shared";
my $filenametwo = $ARGV[1] . ".shared";
#my $logfile = "unshared_reads.log";

open my $readoneout, ">>$filenameone";
open my $readtwoout, ">>$filenametwo";
#open my $logfh, ">>$logfile";

# build arrays of read names
my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;

my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; 

foreach my $readone(@readsone){
  my (@printonetrim,@printtwotrim);
  if (grep ($_ eq $readone, @readstwo)) {
     my $lineone = `grep -n $readone $ARGV[0]`;
     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printone = `head -$linenum $ARGV[0] | tail -4`;

     my $linetwo = `grep -n $readone $ARGV[1]`;
     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;

     print $readoneout @printone;
      print $readtwoout @printtwo;
  }
}

**mgg** · 01-05-2012, 06:22 AM

suggestions

This was nice to find: I was stuck with this exact same issue. By way of putting something back in, here are some changes I made which speeded things up & simplified the code.

These changes work with data generated from an early Cassava pipeline in which each sequence entry has the 4-line format;
@machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum
SEQUENCE
+same as @line
Qcall codes

They have not been tried with data from other pipelines.

I passed the machine ID code (@machine identifier:, from the start of each .fastq @line) as parameter 3, and assigned it to $machine_id. I extended the grep & used the --only-matching parameter to grab the entire sample identifier (but excluding the read identifier digit) for each read. This gave me arrays of read identifiers in the format:
@machine identifier:lane:tile:Xcoord:Ycoord:#index/

Code:

my @readsone = `grep -E $machine_id.*/ --only-matching $ARGV[0]`;
my @readstwo = `grep -E $machine_id.*/ --only-matching $ARGV[1]`;

This directly replaced this code ...

Code:

# build arrays of read names
my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;

... and by directly generating @readsone and @readstwo, also removed the need for this code.

Code:

my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp;

Within the loop processing the arrays, I used the grep parameter -A3 to return an extra three lines to get the entire sequence entry. This also eliminated a few lines of code (#)

Code:

foreach my $readone(@readsone){
#  my (@printonetrim,@printtwotrim);
  if (grep ($_ eq $readone, @readstwo)) {
#     my $lineone = `grep -n $readone $ARGV[0]`;
#     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
#     $linenum = $linenum +3;
#     my @printone = `head -$linenum $ARGV[0] | tail -4`;
     my @printone =`grep -A3 $readone $ARGV[0]`;

#     my $linetwo = `grep -n $readone $ARGV[1]`;
#     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
#     $linenum = $linenum +3;
#     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;
      my @printtwo = `grep -A3 $readone $ARGV[1]`;

      print $readoneout @printone;
      print $readtwoout @printtwo;
  }
}

M

original code:

Code:

#!/usr/bin/env perl

# Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
# Input is quality filtered files for reads 1 and 2
# Output is two new fastq files, one for each read containing only shared reads
# Reads will be output in the same order in each file
# greigite 3.2011

if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}

my (@readsone,@readstwo,@combined);
my $filenameone = $ARGV[0] . ".shared";
my $filenametwo = $ARGV[1] . ".shared";
#my $logfile = "unshared_reads.log";

open my $readoneout, ">>$filenameone";
open my $readtwoout, ">>$filenametwo";
#open my $logfh, ">>$logfile";

# build arrays of read names
my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;

my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; 

foreach my $readone(@readsone){
  my (@printonetrim,@printtwotrim);
  if (grep ($_ eq $readone, @readstwo)) {
     my $lineone = `grep -n $readone $ARGV[0]`;
     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printone = `head -$linenum $ARGV[0] | tail -4`;

     my $linetwo = `grep -n $readone $ARGV[1]`;
     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;

     print $readoneout @printone;
      print $readtwoout @printtwo;
  }
}

**kmcarr** · 01-05-2012, 07:09 AM

Hi all,

Since my advice upthread about using Bio::Index::Fastq was bad I thought I should offer the solution I worked out. This method still relies on indexing the Fastq files, using the cdbfastq:cdbyank utility for the job. This utility is very fast and efficient, even with large files. It can be downloaded here. In fact the attached script is pretty much just a wrapper for cdbfastq and a bunch of standard *nix utilities (cat, sort, comm). These will be far faster than any pure perl implementation.

The script takes as input two fastq files which are presumably paired fastq files after some QC pipeline which has resulted in different numbers/pair members in each. No particular read order is assumed for the input files (the IDs are sorted by the script).

Usage:

Code:

# pairFastq.pl <read1.fastq> <read2.fastq>

The output is three fastq files: Matched read pairs are output to new read1 and read2 files, unpaired reads are output to a singletons file.

I haven't bothered yet to set up command line parsing (beyond the input files) so if you would like the script to output progress messages edit line #14 to:

Code:

$my verbose = 1;

I hope you may find this useful.

Update:
I have just discovered that there is internal limit in cdbfasta of 4GB on the index file size making this method a non-starter for HiSeq sized output.

Attached Files

pairFastq.pl (4.6 KB, 149 views)

**dejavu2010** · 02-28-2012, 04:08 PM

thx

it works, but with a little bit pain that it only works for a <4gb.

**sghignone** · 05-09-2012, 08:26 AM

Hi everybody!
I'm facing the same problem on my processed datasets, and here is my bash scripting approach.

The following code is just a backbone, but single steps have been tested like working on a 10k reads subset.

Fastq read names should be in the format "@machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum"

In the first step, grep uses "@HWI" to identify the read names: pattern should be customizable, so to avoid (as it has been suggested) inclusion of quality lines.

In the second step, sed removes f/r identifiers (\1,\2) [readnum] from reads names, -d option of uniq prints duplicate lines only, so only paired reads are stored in output list. Then the list is loaded in array and used to search reads in forward/reverse processed datasets and build the merged file.

Code:

#!/bin/bash
echo "Step #1 running...Estracting read names"
echo "Estracting forward read names"
grep "@HWI" basename.f.fastq > basename.list
echo "Estracting reverse read names"
grep "@HWI" basename.r.fastq >> basename.list

echo "Step #3 running...Estracting paired reads mames"
sed s/"\/[0-9]"//g basename.list | sort | uniq -d > basename.list.duplicated

echo "Step #3 running...Merging forward and reverse paired reads"
array=(`cat basename.list.duplicated`)
len=${#array[*]}

i=0
while [ $i -lt $len ]; do
	grep -w -A 3 "${array[$i]}" basename.f.fastq >> basename.merged.fastq
	grep -w -A 3 "${array[$i]}" basename.r.fastq >> basename.merged.fastq
	let i++
done

I'm posting the backbone because I'm looking for suggestions to enhance the script performance (sort of forking? threading?); I still have to try it on a real 40 milion reads data, and I don't know how it works on my 12Gb RAM machine.

TO DO, obviously, the recovery of orphan reads!

Any help is really appreciated!

Cheers

stefano

**dpryan** · 05-09-2012, 12:44 PM

The biggest choke point with that script is grepping through the entire fastq file for every read. I could foresee that getting painfully slow very quickly. Since you're sorting the read names to eliminate duplicates, you'll be processing reads in an ordered manner. You could cut down on your grep time by splitting the fastq files according to lane and tile, then only grepping through the appropriate one. I expect this would significantly increase performance. If you wrote this in something other than a scripting language, you could also read the individual fastq files into memory to decrease the disk i/o penalty (and open the possibility of decreasing the search space after each search cycle, depending on how things are stored in memory).

Originally posted by sghignone View Post

Hi everybody!
I'm facing the same problem on my processed datasets, and here is my bash scripting approach.

The following code is just a backbone, but single steps have been tested like working on a 10k reads subset.

Fastq read names should be in the format "@machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum"

In the first step, grep uses "@HWI" to identify the read names: pattern should be customizable, so to avoid (as it has been suggested) inclusion of quality lines.

In the second step, sed removes f/r identifiers (\1,\2) [readnum] from reads names, -d option of uniq prints duplicate lines only, so only paired reads are stored in output list. Then the list is loaded in array and used to search reads in forward/reverse processed datasets and build the merged file.

Code:

#!/bin/bash
echo "Step #1 running...Estracting read names"
echo "Estracting forward read names"
grep "@HWI" basename.f.fastq > basename.list
echo "Estracting reverse read names"
grep "@HWI" basename.r.fastq >> basename.list

echo "Step #3 running...Estracting paired reads mames"
sed s/"\/[0-9]"//g basename.list | sort | uniq -d > basename.list.duplicated

echo "Step #3 running...Merging forward and reverse paired reads"
array=(`cat basename.list.duplicated`)
len=${#array[*]}

i=0
while [ $i -lt $len ]; do
	grep -w -A 3 "${array[$i]}" basename.f.fastq >> basename.merged.fastq
	grep -w -A 3 "${array[$i]}" basename.r.fastq >> basename.merged.fastq
	let i++
done

I'm posting the backbone because I'm looking for suggestions to enhance the script performance (sort of forking? threading?); I still have to try it on a real 40 milion reads data, and I don't know how it works on my 12Gb RAM machine.

TO DO, obviously, the recovery of orphan reads!

Any help is really appreciated!

Cheers

stefano

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Combining the paired reads from Illumina run

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News