Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining the paired reads from Illumina run

    Hi,

    I have two fastq files with the forward(/1) and reverse(/2) paired reads. The reads are not in same order in either file, some pairs are absent/missing and the files are 8 GB each with abt 30 mill reads each.

    I am trying to pull out all the paired reads for which both fwd and rev exist and running out of memory using a Bioperl script. Are there any C or C++ based efficient tools out there for doing this? Any algorithm ideas where I don't need to read in both read files into memory and can just parse through them.

    Thanks!

    -S.

  • #2
    S.

    I haven't used it myself but Bio::Index::Fastq looks like it could be useful. Using indexed fastq files you don't need to keep the entirety of your data in memory.

    My first try would be to index both the files and then create a list of ids with both fwd/rev reads. Then stream through the list and output new fwd/rev fastq files with just the paired reads, in the same order.

    Update 2012-01-05:

    To anyone reading this thread anew, don't bother trying to use Bio::Index::Fastq (at least as it exists on this date). It simply does not scale to the data volumes coming from Illumina instruments. If you attempt to use it your computer will literally catch on fire. OK, not really, but it will likely fill all its memory and swap and come to a grinding halt.

    See my new message below.
    Last edited by kmcarr; 01-05-2012, 06:47 AM. Reason: Bio::Index::Fastq will not work

    Comment


    • #3
      Good suggestion! I have used other indexing modules from BioPerl. I am trying out a Perl hack right now but it that doesn't work, I will give Bio::Index::Fastq a shot. Thanks!

      Best,
      Surya

      Comment


      • #4
        I wrote a perl script to do this using grep- it works by retrieving read names using the fact that every read name has a "@" before it. It works but is slow, even though it doesn't need to read all the files into memory. I'm sure someone could improve on it...


        Code:
        #!/usr/bin/env perl
        
        # Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
        # Input is quality filtered files for reads 1 and 2
        # Output is two new fastq files, one for each read containing only shared reads
        # Reads will be output in the same order in each file
        # greigite 3.2011
        
        if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}
        
        my (@readsone,@readstwo);
        my $filenameone = $ARGV[0] . ".shared";
        my $filenametwo = $ARGV[1] . ".shared";
        #my $logfile = "unshared_reads.log";
        
        open my $readoneout, ">>$filenameone";
        open my $readtwoout, ">>$filenametwo";
        #open my $logfh, ">>$logfile";
        
        # build arrays of read names
        my @readsonetmp = `grep "@" $ARGV[0]`;
        my @readstwotmp = `grep "@" $ARGV[1]`;
        
        my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; #trim off read identifier "/1"
        my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; #trim off read identifier "/2"
        
        foreach my $readone(@readsone){
          my (@printonetrim,@printtwotrim);
          if (grep ($_ eq $readone, @readstwo)) {
             my $lineone = `grep -n $readone $ARGV[0]`;
             my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
             $linenum = $linenum +3;
             my @printone = `head -$linenum $ARGV[0] | tail -4`;
        
             my $linetwo = `grep -n $readone $ARGV[1]`;
             my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
             $linenum = $linenum +3;
             my @printtwo = `head -$linenum $ARGV[1] | tail -4`;
        
             print $readoneout @printone;
              print $readtwoout @printtwo;
          }
        }

        Comment


        • #5
          Originally posted by greigite View Post
          I wrote a perl script to do this using grep- it works by retrieving read names using the fact that every read name has a "@" before it. It works but is slow, even though it doesn't need to read all the files into memory. I'm sure someone could improve on it...
          Be very careful with that. In some quality scales, @ is a valid quality score, so you could match to that, instead of name, if its first in the string.

          @NAME is probably a safer thing to grep for, the odds of your quality score having multiple letters in a row in common with your name are much lower.

          Comment


          • #6
            Good point, swbarnes. I haven't had any problems occur yet, but better not to take chances. Here is a corrected version:

            Code:
            #!/usr/bin/env perl
            
            # Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
            # Input is quality filtered files for reads 1 and 2
            # Output is two new fastq files, one for each read containing only shared reads
            # Reads will be output in the same order in each file
            # greigite 3.2011
            
            if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}
            
            my (@readsone,@readstwo,@combined);
            my $filenameone = $ARGV[0] . ".shared";
            my $filenametwo = $ARGV[1] . ".shared";
            #my $logfile = "unshared_reads.log";
            
            open my $readoneout, ">>$filenameone";
            open my $readtwoout, ">>$filenametwo";
            #open my $logfh, ">>$logfile";
            
            # build arrays of read names
            my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
            my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;
            
            my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
            my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; 
            
            foreach my $readone(@readsone){
              my (@printonetrim,@printtwotrim);
              if (grep ($_ eq $readone, @readstwo)) {
                 my $lineone = `grep -n $readone $ARGV[0]`;
                 my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
                 $linenum = $linenum +3;
                 my @printone = `head -$linenum $ARGV[0] | tail -4`;
            
                 my $linetwo = `grep -n $readone $ARGV[1]`;
                 my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
                 $linenum = $linenum +3;
                 my @printtwo = `head -$linenum $ARGV[1] | tail -4`;
            
                 print $readoneout @printone;
                  print $readtwoout @printtwo;
              }
            }

            Comment


            • #7
              Thank you for the replies, greigite and swbarnes. Just finished parsing the files.

              Comment


              • #8
                I have two pair-end fastq files. I timmed the 3' of the reads with low quality score, so the length of each reads of the pair is not same. But the order and the number of the reads are same. Will it effects bwa function ? I have some problem of sampe. I do not know wether it is this problem?


                Originally posted by greigite View Post
                Good point, swbarnes. I haven't had any problems occur yet, but better not to take chances. Here is a corrected version:

                Code:
                #!/usr/bin/env perl
                
                # Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
                # Input is quality filtered files for reads 1 and 2
                # Output is two new fastq files, one for each read containing only shared reads
                # Reads will be output in the same order in each file
                # greigite 3.2011
                
                if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}
                
                my (@readsone,@readstwo,@combined);
                my $filenameone = $ARGV[0] . ".shared";
                my $filenametwo = $ARGV[1] . ".shared";
                #my $logfile = "unshared_reads.log";
                
                open my $readoneout, ">>$filenameone";
                open my $readtwoout, ">>$filenametwo";
                #open my $logfh, ">>$logfile";
                
                # build arrays of read names
                my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
                my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;
                
                my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
                my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; 
                
                foreach my $readone(@readsone){
                  my (@printonetrim,@printtwotrim);
                  if (grep ($_ eq $readone, @readstwo)) {
                     my $lineone = `grep -n $readone $ARGV[0]`;
                     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
                     $linenum = $linenum +3;
                     my @printone = `head -$linenum $ARGV[0] | tail -4`;
                
                     my $linetwo = `grep -n $readone $ARGV[1]`;
                     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
                     $linenum = $linenum +3;
                     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;
                
                     print $readoneout @printone;
                      print $readtwoout @printtwo;
                  }
                }

                Comment


                • #9
                  suggestions

                  This was nice to find: I was stuck with this exact same issue. By way of putting something back in, here are some changes I made which speeded things up & simplified the code.

                  These changes work with data generated from an early Cassava pipeline in which each sequence entry has the 4-line format;
                  @machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum
                  SEQUENCE
                  +same as @line
                  Qcall codes

                  They have not been tried with data from other pipelines.

                  I passed the machine ID code (@machine identifier:, from the start of each .fastq @line) as parameter 3, and assigned it to $machine_id. I extended the grep & used the --only-matching parameter to grab the entire sample identifier (but excluding the read identifier digit) for each read. This gave me arrays of read identifiers in the format:
                  @machine identifier:lane:tile:Xcoord:Ycoord:#index/
                  Code:
                  my @readsone = `grep -E $machine_id.*/ --only-matching $ARGV[0]`;
                  my @readstwo = `grep -E $machine_id.*/ --only-matching $ARGV[1]`;
                  This directly replaced this code ...
                  Code:
                  # build arrays of read names
                  my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
                  my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;
                  ... and by directly generating @readsone and @readstwo, also removed the need for this code.
                  Code:
                  my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
                  my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp;
                  Within the loop processing the arrays, I used the grep parameter -A3 to return an extra three lines to get the entire sequence entry. This also eliminated a few lines of code (#)

                  Code:
                  foreach my $readone(@readsone){
                  #  my (@printonetrim,@printtwotrim);
                    if (grep ($_ eq $readone, @readstwo)) {
                  #     my $lineone = `grep -n $readone $ARGV[0]`;
                  #     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
                  #     $linenum = $linenum +3;
                  #     my @printone = `head -$linenum $ARGV[0] | tail -4`;
                       my @printone =`grep -A3 $readone $ARGV[0]`;
                  
                  #     my $linetwo = `grep -n $readone $ARGV[1]`;
                  #     my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
                  #     $linenum = $linenum +3;
                  #     my @printtwo = `head -$linenum $ARGV[1] | tail -4`;
                        my @printtwo = `grep -A3 $readone $ARGV[1]`;
                  
                        print $readoneout @printone;
                        print $readtwoout @printtwo;
                    }
                  }

                  M


                  original code:
                  Code:
                  #!/usr/bin/env perl
                  
                  # Program to compare two quality filtered fastq files from a PE run and output new files containing only shared sequences
                  # Input is quality filtered files for reads 1 and 2
                  # Output is two new fastq files, one for each read containing only shared reads
                  # Reads will be output in the same order in each file
                  # greigite 3.2011
                  
                  if (@ARGV != 2) {    die "Usage: filter_read_pairs.pl <fastq for read 1> <fastq for read 2>\n\n";}
                  
                  my (@readsone,@readstwo,@combined);
                  my $filenameone = $ARGV[0] . ".shared";
                  my $filenametwo = $ARGV[1] . ".shared";
                  #my $logfile = "unshared_reads.log";
                  
                  open my $readoneout, ">>$filenameone";
                  open my $readtwoout, ">>$filenametwo";
                  #open my $logfh, ">>$logfile";
                  
                  # build arrays of read names
                  my @readsonetmp = `grep "@ILLUMINA" $ARGV[0]`;
                  my @readstwotmp = `grep "@ILLUMINA" $ARGV[1]`;
                  
                  my @readsone = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readsonetmp; 
                  my @readstwo = map{ chomp $_; $_ =~s/'//g; my ($trim) = $_ =~/(.*)\/[0-9]/; $trim;}@readstwotmp; 
                  
                  foreach my $readone(@readsone){
                    my (@printonetrim,@printtwotrim);
                    if (grep ($_ eq $readone, @readstwo)) {
                       my $lineone = `grep -n $readone $ARGV[0]`;
                       my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
                       $linenum = $linenum +3;
                       my @printone = `head -$linenum $ARGV[0] | tail -4`;
                  
                       my $linetwo = `grep -n $readone $ARGV[1]`;
                       my ($linenum) = $linetwo =~/([0-9]{1,}):.*/;
                       $linenum = $linenum +3;
                       my @printtwo = `head -$linenum $ARGV[1] | tail -4`;
                  
                       print $readoneout @printone;
                        print $readtwoout @printtwo;
                    }
                  }

                  Comment


                  • #10
                    Hi all,

                    Since my advice upthread about using Bio::Index::Fastq was bad I thought I should offer the solution I worked out. This method still relies on indexing the Fastq files, using the cdbfastq:cdbyank utility for the job. This utility is very fast and efficient, even with large files. It can be downloaded here. In fact the attached script is pretty much just a wrapper for cdbfastq and a bunch of standard *nix utilities (cat, sort, comm). These will be far faster than any pure perl implementation.

                    The script takes as input two fastq files which are presumably paired fastq files after some QC pipeline which has resulted in different numbers/pair members in each. No particular read order is assumed for the input files (the IDs are sorted by the script).

                    Usage:

                    Code:
                    # pairFastq.pl <read1.fastq> <read2.fastq>
                    The output is three fastq files: Matched read pairs are output to new read1 and read2 files, unpaired reads are output to a singletons file.

                    I haven't bothered yet to set up command line parsing (beyond the input files) so if you would like the script to output progress messages edit line #14 to:

                    Code:
                    $my verbose = 1;
                    I hope you may find this useful.

                    Update:
                    I have just discovered that there is internal limit in cdbfasta of 4GB on the index file size making this method a non-starter for HiSeq sized output.
                    Attached Files
                    Last edited by kmcarr; 02-24-2012, 08:43 AM. Reason: Discovered limitation in cdbfasta

                    Comment


                    • #11
                      thx

                      it works, but with a little bit pain that it only works for a <4gb.

                      Comment


                      • #12
                        Hi everybody!
                        I'm facing the same problem on my processed datasets, and here is my bash scripting approach.

                        The following code is just a backbone, but single steps have been tested like working on a 10k reads subset.

                        Fastq read names should be in the format "@machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum"

                        In the first step, grep uses "@HWI" to identify the read names: pattern should be customizable, so to avoid (as it has been suggested) inclusion of quality lines.

                        In the second step, sed removes f/r identifiers (\1,\2) [readnum] from reads names, -d option of uniq prints duplicate lines only, so only paired reads are stored in output list. Then the list is loaded in array and used to search reads in forward/reverse processed datasets and build the merged file.

                        Code:
                        #!/bin/bash
                        echo "Step #1 running...Estracting read names"
                        echo "Estracting forward read names"
                        grep "@HWI" basename.f.fastq > basename.list
                        echo "Estracting reverse read names"
                        grep "@HWI" basename.r.fastq >> basename.list
                        
                        echo "Step #3 running...Estracting paired reads mames"
                        sed s/"\/[0-9]"//g basename.list | sort | uniq -d > basename.list.duplicated
                        
                        echo "Step #3 running...Merging forward and reverse paired reads"
                        array=(`cat basename.list.duplicated`)
                        len=${#array[*]}
                        
                        i=0
                        while [ $i -lt $len ]; do
                        	grep -w -A 3 "${array[$i]}" basename.f.fastq >> basename.merged.fastq
                        	grep -w -A 3 "${array[$i]}" basename.r.fastq >> basename.merged.fastq
                        	let i++
                        done
                        I'm posting the backbone because I'm looking for suggestions to enhance the script performance (sort of forking? threading?); I still have to try it on a real 40 milion reads data, and I don't know how it works on my 12Gb RAM machine.

                        TO DO, obviously, the recovery of orphan reads!

                        Any help is really appreciated!

                        Cheers

                        stefano

                        Comment


                        • #13
                          The biggest choke point with that script is grepping through the entire fastq file for every read. I could foresee that getting painfully slow very quickly. Since you're sorting the read names to eliminate duplicates, you'll be processing reads in an ordered manner. You could cut down on your grep time by splitting the fastq files according to lane and tile, then only grepping through the appropriate one. I expect this would significantly increase performance. If you wrote this in something other than a scripting language, you could also read the individual fastq files into memory to decrease the disk i/o penalty (and open the possibility of decreasing the search space after each search cycle, depending on how things are stored in memory).

                          Originally posted by sghignone View Post
                          Hi everybody!
                          I'm facing the same problem on my processed datasets, and here is my bash scripting approach.

                          The following code is just a backbone, but single steps have been tested like working on a 10k reads subset.

                          Fastq read names should be in the format "@machine identifier:lane:tile:Xcoord:Ycoord:#index/readnum"

                          In the first step, grep uses "@HWI" to identify the read names: pattern should be customizable, so to avoid (as it has been suggested) inclusion of quality lines.

                          In the second step, sed removes f/r identifiers (\1,\2) [readnum] from reads names, -d option of uniq prints duplicate lines only, so only paired reads are stored in output list. Then the list is loaded in array and used to search reads in forward/reverse processed datasets and build the merged file.

                          Code:
                          #!/bin/bash
                          echo "Step #1 running...Estracting read names"
                          echo "Estracting forward read names"
                          grep "@HWI" basename.f.fastq > basename.list
                          echo "Estracting reverse read names"
                          grep "@HWI" basename.r.fastq >> basename.list
                          
                          echo "Step #3 running...Estracting paired reads mames"
                          sed s/"\/[0-9]"//g basename.list | sort | uniq -d > basename.list.duplicated
                          
                          echo "Step #3 running...Merging forward and reverse paired reads"
                          array=(`cat basename.list.duplicated`)
                          len=${#array[*]}
                          
                          i=0
                          while [ $i -lt $len ]; do
                          	grep -w -A 3 "${array[$i]}" basename.f.fastq >> basename.merged.fastq
                          	grep -w -A 3 "${array[$i]}" basename.r.fastq >> basename.merged.fastq
                          	let i++
                          done
                          I'm posting the backbone because I'm looking for suggestions to enhance the script performance (sort of forking? threading?); I still have to try it on a real 40 milion reads data, and I don't know how it works on my 12Gb RAM machine.

                          TO DO, obviously, the recovery of orphan reads!

                          Any help is really appreciated!

                          Cheers

                          stefano

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Advancing Precision Medicine for Rare Diseases in Children
                            by seqadmin




                            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                            12-16-2024, 07:57 AM
                          • seqadmin
                            Recent Advances in Sequencing Technologies
                            by seqadmin



                            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                            Long-Read Sequencing
                            Long-read sequencing has seen remarkable advancements,...
                            12-02-2024, 01:49 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 12-17-2024, 10:28 AM
                          0 responses
                          32 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-13-2024, 08:24 AM
                          0 responses
                          48 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-12-2024, 07:41 AM
                          0 responses
                          34 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 12-11-2024, 07:45 AM
                          0 responses
                          46 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X