SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
illumina single-end reads run cufflink louis7781x Bioinformatics 3 04-23-2011 07:05 AM
set up TOPHAT run with paired end reads PFS Bioinformatics 1 03-08-2011 05:45 PM
Limiting Illumina Paired-End Reads cryptic_star Bioinformatics 1 06-21-2010 06:30 AM
Total reads in Illumina run nbr General 3 10-20-2009 05:35 PM
Identifying paired reads Solexa run HMorrison Bioinformatics 2 09-02-2009 11:22 AM

Reply
 
Thread Tools
Old 03-29-2011, 08:03 AM   #1
suryasaha
Member
 
Location: New York

Join Date: Mar 2011
Posts: 27
Default Combining the paired reads from Illumina run

Hi,

I have two fastq files with the forward(/1) and reverse(/2) paired reads. The reads are not in same order in either file, some pairs are absent/missing and the files are 8 GB each with abt 30 mill reads each.

I am trying to pull out all the paired reads for which both fwd and rev exist and running out of memory using a Bioperl script. Are there any C or C++ based efficient tools out there for doing this? Any algorithm ideas where I don't need to read in both read files into memory and can just parse through them.

Thanks!

-S.
suryasaha is offline   Reply With Quote
Old 03-29-2011, 10:28 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

S.

I haven't used it myself but Bio::Index::Fastq looks like it could be useful. Using indexed fastq files you don't need to keep the entirety of your data in memory.

My first try would be to index both the files and then create a list of ids with both fwd/rev reads. Then stream through the list and output new fwd/rev fastq files with just the paired reads, in the same order.

Update 2012-01-05:

To anyone reading this thread anew, don't bother trying to use Bio::Index::Fastq (at least as it exists on this date). It simply does not scale to the data volumes coming from Illumina instruments. If you attempt to use it your computer will literally catch on fire. OK, not really, but it will likely fill all its memory and swap and come to a grinding halt.

See my new message below.

Last edited by kmcarr; 01-05-2012 at 06:47 AM. Reason: Bio::Index::Fastq will not work
kmcarr is offline   Reply With Quote
Old 03-29-2011, 10:57 AM   #3
suryasaha
Member
 
Location: New York

Join Date: Mar 2011
Posts: 27
Default

Good suggestion! I have used other indexing modules from BioPerl. I am trying out a Perl hack right now but it that doesn't work, I will give Bio::Index::Fastq a shot. Thanks!

Best,
Surya
suryasaha is offline   Reply With Quote
Old 03-29-2011, 12:01 PM   #4
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default

I wrote a perl script to do this using grep- it works by retrieving read names using the fact that every read name has a "@" before it. It works but is slow, even though it doesn't need to read all the files into memory. I'm sure someone could improve on it...


Code:
#!/usr/bin/env perl

# Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
# Input is quality filtered files for reads 1 and 2
# Output is two new fastq files, one for each read containing only shared reads
# Reads will be output in the same order in each file
# greigite 3.2011

if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}

my (@readsone,@readstwo);
my $filenameone = $ARGV[0] . ".shared";
my $filenametwo = $ARGV[1] . ".shared";
#my $logfile = "unshared_reads.log";

open my $readoneout, ">>$filenameone";
open my $readtwoout, ">>$filenametwo";
#open my $logfh, ">>$logfile";

# build arrays of read names
my @readsonetmp = `grep "@" $ARGV[0]`;
my @readstwotmp = `grep "@" $ARGV[1]`;

my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; #trim off read identifier "/1"
my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; #trim off read identifier "/2"

foreach my $readone(@readsone){
  my (@printonetrim,@printtwotrim);
  if (grep ($_ eq $readone, @readstwo)) {
     my $lineone = `grep -n $readone $ARGV[0]`;
     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printone = `head -$linenum $ARGV[0] | tail -4`;

     my $linetwo = `grep -n $readone $ARGV[1]`;
     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;

     print $readoneout @printone;
      print $readtwoout @printtwo;
  }
}
greigite is offline   Reply With Quote
Old 03-29-2011, 12:26 PM   #5
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by greigite View Post
I wrote a perl script to do this using grep- it works by retrieving read names using the fact that every read name has a "@" before it. It works but is slow, even though it doesn't need to read all the files into memory. I'm sure someone could improve on it...
Be very careful with that. In some quality scales, @ is a valid quality score, so you could match to that, instead of name, if its first in the string.

@NAME is probably a safer thing to grep for, the odds of your quality score having multiple letters in a row in common with your name are much lower.
swbarnes2 is offline   Reply With Quote
Old 03-29-2011, 12:46 PM   #6
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default

Good point, swbarnes. I haven't had any problems occur yet, but better not to take chances. Here is a corrected version:

Code:
#!/usr/bin/env perl

# Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
# Input is quality filtered files for reads 1 and 2
# Output is two new fastq files, one for each read containing only shared reads
# Reads will be output in the same order in each file
# greigite 3.2011

if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}

my (@readsone,@readstwo,@combined);
my $filenameone = $ARGV[0] . ".shared";
my $filenametwo = $ARGV[1] . ".shared";
#my $logfile = "unshared_reads.log";

open my $readoneout, ">>$filenameone";
open my $readtwoout, ">>$filenametwo";
#open my $logfh, ">>$logfile";

# build arrays of read names
my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;

my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; 

foreach my $readone(@readsone){
  my (@printonetrim,@printtwotrim);
  if (grep ($_ eq $readone, @readstwo)) {
     my $lineone = `grep -n $readone $ARGV[0]`;
     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printone = `head -$linenum $ARGV[0] | tail -4`;

     my $linetwo = `grep -n $readone $ARGV[1]`;
     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;

     print $readoneout @printone;
      print $readtwoout @printtwo;
  }
}
greigite is offline   Reply With Quote
Old 03-30-2011, 06:44 AM   #7
suryasaha
Member
 
Location: New York

Join Date: Mar 2011
Posts: 27
Default

Thank you for the replies, greigite and swbarnes. Just finished parsing the files.
suryasaha is offline   Reply With Quote
Old 03-31-2011, 04:16 PM   #8
elisadouzi
Member
 
Location: US

Join Date: Mar 2011
Posts: 20
Default

I have two pair-end fastq files. I timmed the 3' of the reads with low quality score, so the length of each reads of the pair is not same. But the order and the number of the reads are same. Will it effects bwa function ? I have some problem of sampe. I do not know wether it is this problem?


Quote:
Originally Posted by greigite View Post
Good point, swbarnes. I haven't had any problems occur yet, but better not to take chances. Here is a corrected version:

Code:
#!/usr/bin/env perl

# Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
# Input is quality filtered files for reads 1 and 2
# Output is two new fastq files, one for each read containing only shared reads
# Reads will be output in the same order in each file
# greigite 3.2011

if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}

my (@readsone,@readstwo,@combined);
my $filenameone = $ARGV[0] . ".shared";
my $filenametwo = $ARGV[1] . ".shared";
#my $logfile = "unshared_reads.log";

open my $readoneout, ">>$filenameone";
open my $readtwoout, ">>$filenametwo";
#open my $logfh, ">>$logfile";

# build arrays of read names
my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;

my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; 

foreach my $readone(@readsone){
  my (@printonetrim,@printtwotrim);
  if (grep ($_ eq $readone, @readstwo)) {
     my $lineone = `grep -n $readone $ARGV[0]`;
     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printone = `head -$linenum $ARGV[0] | tail -4`;

     my $linetwo = `grep -n $readone $ARGV[1]`;
     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;

     print $readoneout @printone;
      print $readtwoout @printtwo;
  }
}
elisadouzi is offline   Reply With Quote
Old 01-05-2012, 06:22 AM   #9
mgg
Member
 
Location: London, UK

Join Date: Nov 2011
Posts: 12
Default suggestions

This was nice to find: I was stuck with this exact same issue. By way of putting something back in, here are some changes I made which speeded things up & simplified the code.

These changes work with data generated from an early Cassava pipeline in which each sequence entry has the 4-line format;
@machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum
SEQUENCE
+same as @line
Qcall codes

They have not been tried with data from other pipelines.

I passed the machine ID code (@machine identifier:, from the start of each .fastq @line) as parameter 3, and assigned it to $machine_id. I extended the grep & used the --only-matching parameter to grab the entire sample identifier (but excluding the read identifier digit) for each read. This gave me arrays of read identifiers in the format:
@machine identifier:lane:tile:Xcoord:Ycoord:#index/
Code:
my @readsone = `grep -E $machine_id.*/ --only-matching $ARGV[0]`;
my @readstwo = `grep -E $machine_id.*/ --only-matching $ARGV[1]`;
This directly replaced this code ...
Code:
# build arrays of read names
my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;
... and by directly generating @readsone and @readstwo, also removed the need for this code.
Code:
my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp;
Within the loop processing the arrays, I used the grep parameter -A3 to return an extra three lines to get the entire sequence entry. This also eliminated a few lines of code (#)

Code:
foreach my $readone(@readsone){
#  my (@printonetrim,@printtwotrim);
  if (grep ($_ eq $readone, @readstwo)) {
#     my $lineone = `grep -n $readone $ARGV[0]`;
#     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
#     $linenum = $linenum +3;
#     my @printone = `head -$linenum $ARGV[0] | tail -4`;
     my @printone =`grep -A3 $readone $ARGV[0]`;

#     my $linetwo = `grep -n $readone $ARGV[1]`;
#     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
#     $linenum = $linenum +3;
#     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;
      my @printtwo = `grep -A3 $readone $ARGV[1]`;

      print $readoneout @printone;
      print $readtwoout @printtwo;
  }
}

M


original code:
Code:
#!/usr/bin/env perl

# Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
# Input is quality filtered files for reads 1 and 2
# Output is two new fastq files, one for each read containing only shared reads
# Reads will be output in the same order in each file
# greigite 3.2011

if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}

my (@readsone,@readstwo,@combined);
my $filenameone = $ARGV[0] . ".shared";
my $filenametwo = $ARGV[1] . ".shared";
#my $logfile = "unshared_reads.log";

open my $readoneout, ">>$filenameone";
open my $readtwoout, ">>$filenametwo";
#open my $logfh, ">>$logfile";

# build arrays of read names
my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;

my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; 

foreach my $readone(@readsone){
  my (@printonetrim,@printtwotrim);
  if (grep ($_ eq $readone, @readstwo)) {
     my $lineone = `grep -n $readone $ARGV[0]`;
     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printone = `head -$linenum $ARGV[0] | tail -4`;

     my $linetwo = `grep -n $readone $ARGV[1]`;
     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;

     print $readoneout @printone;
      print $readtwoout @printtwo;
  }
}
mgg is offline   Reply With Quote
Old 01-05-2012, 07:09 AM   #10
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Hi all,

Since my advice upthread about using Bio::Index::Fastq was bad I thought I should offer the solution I worked out. This method still relies on indexing the Fastq files, using the cdbfastq:cdbyank utility for the job. This utility is very fast and efficient, even with large files. It can be downloaded here. In fact the attached script is pretty much just a wrapper for cdbfastq and a bunch of standard *nix utilities (cat, sort, comm). These will be far faster than any pure perl implementation.

The script takes as input two fastq files which are presumably paired fastq files after some QC pipeline which has resulted in different numbers/pair members in each. No particular read order is assumed for the input files (the IDs are sorted by the script).

Usage:

Code:
# pairFastq.pl <read1.fastq> <read2.fastq>
The output is three fastq files: Matched read pairs are output to new read1 and read2 files, unpaired reads are output to a singletons file.

I haven't bothered yet to set up command line parsing (beyond the input files) so if you would like the script to output progress messages edit line #14 to:

Code:
$my verbose = 1;
I hope you may find this useful.

Update:
I have just discovered that there is internal limit in cdbfasta of 4GB on the index file size making this method a non-starter for HiSeq sized output.
Attached Files
File Type: pl pairFastq.pl (4.6 KB, 126 views)

Last edited by kmcarr; 02-24-2012 at 08:43 AM. Reason: Discovered limitation in cdbfasta
kmcarr is offline   Reply With Quote
Old 02-28-2012, 04:08 PM   #11
dejavu2010
Member
 
Location: usa

Join Date: Jan 2012
Posts: 21
Default thx

it works, but with a little bit pain that it only works for a <4gb.
dejavu2010 is offline   Reply With Quote
Old 05-09-2012, 09:26 AM   #12
sghignone
Junior Member
 
Location: Italy

Join Date: Nov 2008
Posts: 6
Default

Hi everybody!
I'm facing the same problem on my processed datasets, and here is my bash scripting approach.

The following code is just a backbone, but single steps have been tested like working on a 10k reads subset.

Fastq read names should be in the format "@machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum"

In the first step, grep uses "@HWI" to identify the read names: pattern should be customizable, so to avoid (as it has been suggested) inclusion of quality lines.

In the second step, sed removes f/r identifiers (\1,\2) [readnum] from reads names, -d option of uniq prints duplicate lines only, so only paired reads are stored in output list. Then the list is loaded in array and used to search reads in forward/reverse processed datasets and build the merged file.

Code:
#!/bin/bash
echo "Step #1 running...Estracting read names"
echo "Estracting forward read names"
grep "@HWI" basename.f.fastq > basename.list
echo "Estracting reverse read names"
grep "@HWI" basename.r.fastq >> basename.list

echo "Step #3 running...Estracting paired reads mames"
sed s/"\/[0-9]"//g basename.list | sort | uniq -d > basename.list.duplicated

echo "Step #3 running...Merging forward and reverse paired reads"
array=(`cat basename.list.duplicated`)
len=${#array[*]}

i=0
while [ $i -lt $len ]; do
	grep -w -A 3 "${array[$i]}" basename.f.fastq >> basename.merged.fastq
	grep -w -A 3 "${array[$i]}" basename.r.fastq >> basename.merged.fastq
	let i++
done
I'm posting the backbone because I'm looking for suggestions to enhance the script performance (sort of forking? threading?); I still have to try it on a real 40 milion reads data, and I don't know how it works on my 12Gb RAM machine.

TO DO, obviously, the recovery of orphan reads!

Any help is really appreciated!

Cheers

stefano
sghignone is offline   Reply With Quote
Old 05-09-2012, 01:44 PM   #13
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The biggest choke point with that script is grepping through the entire fastq file for every read. I could foresee that getting painfully slow very quickly. Since you're sorting the read names to eliminate duplicates, you'll be processing reads in an ordered manner. You could cut down on your grep time by splitting the fastq files according to lane and tile, then only grepping through the appropriate one. I expect this would significantly increase performance. If you wrote this in something other than a scripting language, you could also read the individual fastq files into memory to decrease the disk i/o penalty (and open the possibility of decreasing the search space after each search cycle, depending on how things are stored in memory).

Quote:
Originally Posted by sghignone View Post
Hi everybody!
I'm facing the same problem on my processed datasets, and here is my bash scripting approach.

The following code is just a backbone, but single steps have been tested like working on a 10k reads subset.

Fastq read names should be in the format "@machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum"

In the first step, grep uses "@HWI" to identify the read names: pattern should be customizable, so to avoid (as it has been suggested) inclusion of quality lines.

In the second step, sed removes f/r identifiers (\1,\2) [readnum] from reads names, -d option of uniq prints duplicate lines only, so only paired reads are stored in output list. Then the list is loaded in array and used to search reads in forward/reverse processed datasets and build the merged file.

Code:
#!/bin/bash
echo "Step #1 running...Estracting read names"
echo "Estracting forward read names"
grep "@HWI" basename.f.fastq > basename.list
echo "Estracting reverse read names"
grep "@HWI" basename.r.fastq >> basename.list

echo "Step #3 running...Estracting paired reads mames"
sed s/"\/[0-9]"//g basename.list | sort | uniq -d > basename.list.duplicated

echo "Step #3 running...Merging forward and reverse paired reads"
array=(`cat basename.list.duplicated`)
len=${#array[*]}

i=0
while [ $i -lt $len ]; do
	grep -w -A 3 "${array[$i]}" basename.f.fastq >> basename.merged.fastq
	grep -w -A 3 "${array[$i]}" basename.r.fastq >> basename.merged.fastq
	let i++
done
I'm posting the backbone because I'm looking for suggestions to enhance the script performance (sort of forking? threading?); I still have to try it on a real 40 milion reads data, and I don't know how it works on my 12Gb RAM machine.

TO DO, obviously, the recovery of orphan reads!

Any help is really appreciated!

Cheers

stefano
dpryan is offline   Reply With Quote
Reply

Tags
bioperl, solexa paired-end

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:15 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO