![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
illumina single-end reads run cufflink | louis7781x | Bioinformatics | 3 | 04-23-2011 07:05 AM |
set up TOPHAT run with paired end reads | PFS | Bioinformatics | 1 | 03-08-2011 05:45 PM |
Limiting Illumina Paired-End Reads | cryptic_star | Bioinformatics | 1 | 06-21-2010 06:30 AM |
Total reads in Illumina run | nbr | General | 3 | 10-20-2009 05:35 PM |
Identifying paired reads Solexa run | HMorrison | Bioinformatics | 2 | 09-02-2009 11:22 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: New York Join Date: Mar 2011
Posts: 27
|
![]()
Hi,
I have two fastq files with the forward(/1) and reverse(/2) paired reads. The reads are not in same order in either file, some pairs are absent/missing and the files are 8 GB each with abt 30 mill reads each. I am trying to pull out all the paired reads for which both fwd and rev exist and running out of memory using a Bioperl script. Are there any C or C++ based efficient tools out there for doing this? Any algorithm ideas where I don't need to read in both read files into memory and can just parse through them. Thanks! -S. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
S.
I haven't used it myself but Bio::Index::Fastq looks like it could be useful. Using indexed fastq files you don't need to keep the entirety of your data in memory. My first try would be to index both the files and then create a list of ids with both fwd/rev reads. Then stream through the list and output new fwd/rev fastq files with just the paired reads, in the same order. Update 2012-01-05: To anyone reading this thread anew, don't bother trying to use Bio::Index::Fastq (at least as it exists on this date). It simply does not scale to the data volumes coming from Illumina instruments. If you attempt to use it your computer will literally catch on fire. OK, not really, but it will likely fill all its memory and swap and come to a grinding halt. See my new message below. Last edited by kmcarr; 01-05-2012 at 06:47 AM. Reason: Bio::Index::Fastq will not work |
![]() |
![]() |
![]() |
#3 |
Member
Location: New York Join Date: Mar 2011
Posts: 27
|
![]()
Good suggestion! I have used other indexing modules from BioPerl. I am trying out a Perl hack right now but it that doesn't work, I will give Bio::Index::Fastq a shot. Thanks!
Best, Surya |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Cambridge, MA Join Date: Mar 2009
Posts: 141
|
![]()
I wrote a perl script to do this using grep- it works by retrieving read names using the fact that every read name has a "@" before it. It works but is slow, even though it doesn't need to read all the files into memory. I'm sure someone could improve on it...
Code:
#!/usr/bin/env perl # Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences # Input is quality filtered files for reads 1 and 2 # Output is two new fastq files, one for each read containing only shared reads # Reads will be output in the same order in each file # greigite 3.2011 if (@ARGV != 2) { die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";} my (@readsone,@readstwo); my $filenameone = $ARGV[0] . ".shared"; my $filenametwo = $ARGV[1] . ".shared"; #my $logfile = "unshared_reads.log"; open my $readoneout, ">>$filenameone"; open my $readtwoout, ">>$filenametwo"; #open my $logfh, ">>$logfile"; # build arrays of read names my @readsonetmp = `grep "@" $ARGV[0]`; my @readstwotmp = `grep "@" $ARGV[1]`; my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; #trim off read identifier "/1" my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; #trim off read identifier "/2" foreach my $readone(@readsone){ my (@printonetrim,@printtwotrim); if (grep ($_ eq $readone, @readstwo)) { my $lineone = `grep -n $readone $ARGV[0]`; my ($linenum) = $lineone =~/([0-9]{1,}):.*/; $linenum = $linenum +3; my @printone = `head -$linenum $ARGV[0] | tail -4`; my $linetwo = `grep -n $readone $ARGV[1]`; my ($linenum) = $linetwo =~/([0-9]{1,}):.*/; $linenum = $linenum +3; my @printtwo = `head -$linenum $ARGV[1] | tail -4`; print $readoneout @printone; print $readtwoout @printtwo; } } |
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: San Diego Join Date: May 2008
Posts: 912
|
![]() Quote:
@NAME is probably a safer thing to grep for, the odds of your quality score having multiple letters in a row in common with your name are much lower. |
|
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Cambridge, MA Join Date: Mar 2009
Posts: 141
|
![]()
Good point, swbarnes. I haven't had any problems occur yet, but better not to take chances. Here is a corrected version:
Code:
#!/usr/bin/env perl # Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences # Input is quality filtered files for reads 1 and 2 # Output is two new fastq files, one for each read containing only shared reads # Reads will be output in the same order in each file # greigite 3.2011 if (@ARGV != 2) { die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";} my (@readsone,@readstwo,@combined); my $filenameone = $ARGV[0] . ".shared"; my $filenametwo = $ARGV[1] . ".shared"; #my $logfile = "unshared_reads.log"; open my $readoneout, ">>$filenameone"; open my $readtwoout, ">>$filenametwo"; #open my $logfh, ">>$logfile"; # build arrays of read names my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`; my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`; my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; foreach my $readone(@readsone){ my (@printonetrim,@printtwotrim); if (grep ($_ eq $readone, @readstwo)) { my $lineone = `grep -n $readone $ARGV[0]`; my ($linenum) = $lineone =~/([0-9]{1,}):.*/; $linenum = $linenum +3; my @printone = `head -$linenum $ARGV[0] | tail -4`; my $linetwo = `grep -n $readone $ARGV[1]`; my ($linenum) = $linetwo =~/([0-9]{1,}):.*/; $linenum = $linenum +3; my @printtwo = `head -$linenum $ARGV[1] | tail -4`; print $readoneout @printone; print $readtwoout @printtwo; } } |
![]() |
![]() |
![]() |
#7 |
Member
Location: New York Join Date: Mar 2011
Posts: 27
|
![]()
Thank you for the replies, greigite and swbarnes. Just finished parsing the files.
|
![]() |
![]() |
![]() |
#8 | |
Member
Location: US Join Date: Mar 2011
Posts: 20
|
![]()
I have two pair-end fastq files. I timmed the 3' of the reads with low quality score, so the length of each reads of the pair is not same. But the order and the number of the reads are same. Will it effects bwa function ? I have some problem of sampe. I do not know wether it is this problem?
Quote:
|
|
![]() |
![]() |
![]() |
#9 |
Member
Location: London, UK Join Date: Nov 2011
Posts: 12
|
![]()
This was nice to find: I was stuck with this exact same issue. By way of putting something back in, here are some changes I made which speeded things up & simplified the code.
These changes work with data generated from an early Cassava pipeline in which each sequence entry has the 4-line format; @machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum SEQUENCE +same as @line Qcall codes They have not been tried with data from other pipelines. I passed the machine ID code (@machine identifier:, from the start of each .fastq @line) as parameter 3, and assigned it to $machine_id. I extended the grep & used the --only-matching parameter to grab the entire sample identifier (but excluding the read identifier digit) for each read. This gave me arrays of read identifiers in the format: @machine identifier:lane:tile:Xcoord:Ycoord:#index/ Code:
my @readsone = `grep -E $machine_id.*/ --only-matching $ARGV[0]`; my @readstwo = `grep -E $machine_id.*/ --only-matching $ARGV[1]`; Code:
# build arrays of read names my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`; my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`; Code:
my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; Code:
foreach my $readone(@readsone){ # my (@printonetrim,@printtwotrim); if (grep ($_ eq $readone, @readstwo)) { # my $lineone = `grep -n $readone $ARGV[0]`; # my ($linenum) = $lineone =~/([0-9]{1,}):.*/; # $linenum = $linenum +3; # my @printone = `head -$linenum $ARGV[0] | tail -4`; my @printone =`grep -A3 $readone $ARGV[0]`; # my $linetwo = `grep -n $readone $ARGV[1]`; # my ($linenum) = $linetwo =~/([0-9]{1,}):.*/; # $linenum = $linenum +3; # my @printtwo = `head -$linenum $ARGV[1] | tail -4`; my @printtwo = `grep -A3 $readone $ARGV[1]`; print $readoneout @printone; print $readtwoout @printtwo; } } M original code: Code:
#!/usr/bin/env perl # Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences # Input is quality filtered files for reads 1 and 2 # Output is two new fastq files, one for each read containing only shared reads # Reads will be output in the same order in each file # greigite 3.2011 if (@ARGV != 2) { die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";} my (@readsone,@readstwo,@combined); my $filenameone = $ARGV[0] . ".shared"; my $filenametwo = $ARGV[1] . ".shared"; #my $logfile = "unshared_reads.log"; open my $readoneout, ">>$filenameone"; open my $readtwoout, ">>$filenametwo"; #open my $logfh, ">>$logfile"; # build arrays of read names my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`; my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`; my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; foreach my $readone(@readsone){ my (@printonetrim,@printtwotrim); if (grep ($_ eq $readone, @readstwo)) { my $lineone = `grep -n $readone $ARGV[0]`; my ($linenum) = $lineone =~/([0-9]{1,}):.*/; $linenum = $linenum +3; my @printone = `head -$linenum $ARGV[0] | tail -4`; my $linetwo = `grep -n $readone $ARGV[1]`; my ($linenum) = $linetwo =~/([0-9]{1,}):.*/; $linenum = $linenum +3; my @printtwo = `head -$linenum $ARGV[1] | tail -4`; print $readoneout @printone; print $readtwoout @printtwo; } } |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
Hi all,
Since my advice upthread about using Bio::Index::Fastq was bad I thought I should offer the solution I worked out. This method still relies on indexing the Fastq files, using the cdbfastq:cdbyank utility for the job. This utility is very fast and efficient, even with large files. It can be downloaded here. In fact the attached script is pretty much just a wrapper for cdbfastq and a bunch of standard *nix utilities (cat, sort, comm). These will be far faster than any pure perl implementation. The script takes as input two fastq files which are presumably paired fastq files after some QC pipeline which has resulted in different numbers/pair members in each. No particular read order is assumed for the input files (the IDs are sorted by the script). Usage: Code:
# pairFastq.pl <read1.fastq> <read2.fastq> I haven't bothered yet to set up command line parsing (beyond the input files) so if you would like the script to output progress messages edit line #14 to: Code:
$my verbose = 1; Update: I have just discovered that there is internal limit in cdbfasta of 4GB on the index file size making this method a non-starter for HiSeq sized output. Last edited by kmcarr; 02-24-2012 at 08:43 AM. Reason: Discovered limitation in cdbfasta |
![]() |
![]() |
![]() |
#11 |
Member
Location: usa Join Date: Jan 2012
Posts: 21
|
![]()
it works, but with a little bit pain that it only works for a <4gb.
|
![]() |
![]() |
![]() |
#12 |
Junior Member
Location: Italy Join Date: Nov 2008
Posts: 6
|
![]()
Hi everybody!
I'm facing the same problem on my processed datasets, and here is my bash scripting approach. The following code is just a backbone, but single steps have been tested like working on a 10k reads subset. Fastq read names should be in the format "@machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum" In the first step, grep uses "@HWI" to identify the read names: pattern should be customizable, so to avoid (as it has been suggested) inclusion of quality lines. In the second step, sed removes f/r identifiers (\1,\2) [readnum] from reads names, -d option of uniq prints duplicate lines only, so only paired reads are stored in output list. Then the list is loaded in array and used to search reads in forward/reverse processed datasets and build the merged file. Code:
#!/bin/bash echo "Step #1 running...Estracting read names" echo "Estracting forward read names" grep "@HWI" basename.f.fastq > basename.list echo "Estracting reverse read names" grep "@HWI" basename.r.fastq >> basename.list echo "Step #3 running...Estracting paired reads mames" sed s/"\/[0-9]"//g basename.list | sort | uniq -d > basename.list.duplicated echo "Step #3 running...Merging forward and reverse paired reads" array=(`cat basename.list.duplicated`) len=${#array[*]} i=0 while [ $i -lt $len ]; do grep -w -A 3 "${array[$i]}" basename.f.fastq >> basename.merged.fastq grep -w -A 3 "${array[$i]}" basename.r.fastq >> basename.merged.fastq let i++ done TO DO, obviously, the recovery of orphan reads! Any help is really appreciated! Cheers stefano |
![]() |
![]() |
![]() |
#13 | |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
The biggest choke point with that script is grepping through the entire fastq file for every read. I could foresee that getting painfully slow very quickly. Since you're sorting the read names to eliminate duplicates, you'll be processing reads in an ordered manner. You could cut down on your grep time by splitting the fastq files according to lane and tile, then only grepping through the appropriate one. I expect this would significantly increase performance. If you wrote this in something other than a scripting language, you could also read the individual fastq files into memory to decrease the disk i/o penalty (and open the possibility of decreasing the search space after each search cycle, depending on how things are stored in memory).
Quote:
|
|
![]() |
![]() |
![]() |
Tags |
bioperl, solexa paired-end |
Thread Tools | |
|
|