Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina Paired End Merge script

    Hi all,

    I am looking for a script/program which will take paired end reads from an Illumina run and put into a single fastq file. I did search this site and found a perl script but it does not work. Any help would be appreciated. Thanks.

  • #2
    Not tested, and assumes no blank lines in your files, but this should work:

    Code:
    #!/usr/bin/perl
    use warnings;
    use strict;
    
    # Merge together two FastQ files
    # Usage is merge_fastq.pl [read1 file] [read2 file] [outfile]
    
    
    my ($in1,$in2,$out) = @ARGV;
    
    die "Usage is merge_fastq.pl [read1 file] [read2 file] [outfile]\n" unless ($out);
    
    open (IN1,$in1) or die "Can't open $in1: $!";
    open (IN2,$in2) or die "Can't open $in2: $!";
    open (OUT,'>',$out) or die "Can't write to $out: $!";
    
    my $count;
    while (1) {
      ++$count;
      my $line1 = <IN1>;
      my $line2 = <IN2>;
    
      last unless (defined $line1 and defined $line2);
    
      if ($count % 2) {
        print OUT $line1;
      }
      else {
        chomp $line1;
        print OUT $line1,$line2;
      }
    
    }
    
    close OUT or die "Can't write to $out: $!";

    Comment


    • #3
      Shorty provides a very fast script in perl to merge fastq-sequences in the following way:

      @read_id1/1
      ...
      +
      ...
      @read_id2/2
      ...
      +
      ...
      and so on..

      Code:
      #!/usr/bin/perl
      
      $filenameA = $ARGV[0];
      $filenameB = $ARGV[1];
      $filenameOut = $ARGV[2];
      
      open $FILEA, "< $filenameA";
      open $FILEB, "< $filenameB";
      
      open $OUTFILE, "> $filenameOut";
      
      while(<$FILEA>) {
      	print $OUTFILE $_;
      	$_ = <$FILEA>;
      	print $OUTFILE $_; 
      	$_ = <$FILEA>;
      	print $OUTFILE $_; 
      	$_ = <$FILEA>;
      	print $OUTFILE $_; 
      
      	$_ = <$FILEB>;
      	print $OUTFILE $_; 
      	$_ = <$FILEB>;
      	print $OUTFILE $_;
      	$_ = <$FILEB>;
      	print $OUTFILE $_;
      	$_ = <$FILEB>;
      	print $OUTFILE $_;
      }
      Note: It assumes that both files are of the same size and sequences are in the same order..
      Usage should be: merge.pl file1.fastq file2.fastq out.fastq
      Last edited by Jenzo; 04-07-2011, 12:32 AM.

      Comment


      • #4
        The two posted scripts do slightly different things. The one I posted concatenates the sequences and qualities together so if you started with a 2 x 40bp run then you'd end up with a file of 80bp reads.

        The second script simply places the reads from the two files one after another in the combined file, so you'd end up with a 40bp file which was twice as long. It's roughly equivalent to doing:

        Code:
        cat [file1] [file2] > [outfile]
        except that it puts the equivalent reads next to each other in the final file.

        I guess which one you use depends on how you wanted to combine the files....

        Comment


        • #5
          hehe, thats right ;-) thanks for pointing it out!

          Comment


          • #6
            Thanks for the script Andrew. I tried the script out and it seems that the script joins the files but not in the Paired end fashion.

            Original file (1)

            @HWI-EAS216_0001:1:1:1079:15982#0/1
            TATGCTCTGCCTTGGCTGTGTCATCGTGTTGATGCCAACTGACACGAAACTTCTAGGCTGATTCATCCTAAGTAT
            +
            CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@@CCC>2?>>>A?C@CC7@@@@A<@@@A@@@?C@=CC#######
            @HWI-EAS216_0001:1:1:1079:9356#0/1
            CGCTCAAGAGATGGGCTTTGGGTGCGGAATGGGGATTTGGGTTGTGACCCAATACAGCGGTAGTAGCGTGCAGCA
            +
            BBB=>B=BCCCCCCCCCACCCCBCBCC@BBCCCABC@CCCB@CCA@C?B9C7?@:<@##################

            Original file (2)


            @HWI-EAS216_0001:1:1:1079:15982#0/2
            GTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTTATTGAGTCTGTGTTGAAAAGAAACCACTTACGCATTATACTT
            +
            BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@A?A5C<C@?C=CC4;>A########################
            @HWI-EAS216_0001:1:1:1079:9356#0/2
            GCAGGATTGCCATTCCCATCAGCTTTCTGCTGCACGCTACTACCGCTGTATTGGGTCACAACCCAAATCCCCATT
            +
            CCCCCBBCCCCCACCCCCCCCCC?CCCCBCCCCCCCCCCCBCCCC@ABCCCCCC<C;C>CCCBCCCBC>CCBC>>

            The script from Andrew, does this (putting all the 0/1 reads first)


            @HWI-EAS216_0001:1:1:1079:15982#0/1
            TATGCTCTGCCTTGGCTGTGTCATCGTGTTGATGCCAACTGACACGAAACTTCTAGGCTGATTCATCCTAAGTATGTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTTATTGAGTCTGTGTTGAAAAGAAACCACTTACGCATTATACTT
            +
            CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@@CCC>2?>>>A?C@CC7@@@@A<@@@A@@@?C@=CC#######BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@A?A5C<C@?C=CC4;>A########################
            @HWI-EAS216_0001:1:1:1079:9356#0/1
            CGCTCAAGAGATGGGCTTTGGGTGCGGAATGGGGATTTGGGTTGTGACCCAATACAGCGGTAGTAGCGTGCAGCAGCAGGATTGCCATTCCCATCAGCTTTCTGCTGCACGCTACTACCGCTGTATTGGGTCACAACCCAAATCCCCATT
            +
            BBB=>B=BCCCCCCCCCACCCCBCBCC@BBCCCABC@CCCB@CCA@C?B9C7?@:<@##################CCCCCBBCCCCCACCCCCCCCCC?CCCCBCCCCCCCCCCCBCCCC@ABCCCCCC<C;C>CCCBCCCBC>CCBC>>


            What i want is :

            @HWI-EAS216_0001:1:1:1079:15982#0/1
            TATGCTCTGCCTTGGCTGTGTCATCGTGTTGATGCCAACTGACACGAAACTTCTAGGCTGATTCATCCTAAGTAT
            +
            CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@@CCC>2?>>>A?C@CC7@@@@A<@@@A@@@?C@=CC#######
            @HWI-EAS216_0001:1:1:1079:15982#0/2
            GTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTTATTGAGTCTGTGTTGAAAAGAAACCACTTACGCATTATACTT
            +
            BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@A?A5C<C@?C=CC4;>A########################

            Hope this helps. I know its possible

            Thanks for all the help.

            Comment


            • #7
              Originally posted by newbietonextgen View Post
              What i want is :

              @HWI-EAS216_0001:1:1:1079:15982#0/1
              TATGCTCTGCCTTGGCTGTGTCATCGTGTTGAT
              +
              CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@
              @HWI-EAS216_0001:1:1:1079:15982#0/2
              GTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTT
              +
              BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@
              That's what Jenzo's script would produce isn't it?

              Comment


              • #8
                I think, but i did not try. I used fastq_merge.pl, your script.

                Comment


                • #9
                  I explained in the second note I added that the two scripts posted did different things, and it depended on how you wanted to merge your files. Just out of interest which pipeline are you using which requires the paired files to be placed one after another?

                  Comment


                  • #10
                    Ha, Sorry my mistake. I figured it out. Thanks. SHRiMP requires that paired reads are put one behind the other.

                    Comment


                    • #11
                      Thanks guys. Scarpa too requires a merged fastq with "interleaved" reads (so that reads from the same pair follow each other) and Jenzo's script does that.

                      Comment


                      • #12
                        And what about combining PE reads from multiple runs? I have two runs from the same library and I would like to combine the PE reads into the same file (one file for R1 and one file for R2), keeping the reads separation as per Jenzo's script. Would my code look like something like this?

                        #!/usr/bin/perl

                        $filename_R1_Run1 = $ARGV[0];
                        $filename_R1_Run2 = $ARGV[1];
                        $filename_R1_Runs1And2 = $ARGV[2];

                        open $FILE_R1_Run1, "< $filename_R1_Run1";
                        open $FILE_R1_Run2, "< $filename_R1_Run2";

                        open $FILE_R1_Runs1And2, "> $filename_R1_Runs1And2";

                        while(<$FILE_R1_Run1>) {
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run1>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run1>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run1>;
                        print $FILE_R1_Runs1And2 $_;

                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        }

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Advancing Precision Medicine for Rare Diseases in Children
                          by seqadmin




                          Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                          12-16-2024, 07:57 AM
                        • seqadmin
                          Recent Advances in Sequencing Technologies
                          by seqadmin



                          Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                          Long-Read Sequencing
                          Long-read sequencing has seen remarkable advancements,...
                          12-02-2024, 01:49 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 12-17-2024, 10:28 AM
                        0 responses
                        22 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-13-2024, 08:24 AM
                        0 responses
                        42 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-12-2024, 07:41 AM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-11-2024, 07:45 AM
                        0 responses
                        42 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X