Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina Paired End Merge script

    Hi all,

    I am looking for a script/program which will take paired end reads from an Illumina run and put into a single fastq file. I did search this site and found a perl script but it does not work. Any help would be appreciated. Thanks.

  • #2
    Not tested, and assumes no blank lines in your files, but this should work:

    Code:
    #!/usr/bin/perl
    use warnings;
    use strict;
    
    # Merge together two FastQ files
    # Usage is merge_fastq.pl [read1 file] [read2 file] [outfile]
    
    
    my ($in1,$in2,$out) = @ARGV;
    
    die "Usage is merge_fastq.pl [read1 file] [read2 file] [outfile]\n" unless ($out);
    
    open (IN1,$in1) or die "Can't open $in1: $!";
    open (IN2,$in2) or die "Can't open $in2: $!";
    open (OUT,'>',$out) or die "Can't write to $out: $!";
    
    my $count;
    while (1) {
      ++$count;
      my $line1 = <IN1>;
      my $line2 = <IN2>;
    
      last unless (defined $line1 and defined $line2);
    
      if ($count % 2) {
        print OUT $line1;
      }
      else {
        chomp $line1;
        print OUT $line1,$line2;
      }
    
    }
    
    close OUT or die "Can't write to $out: $!";

    Comment


    • #3
      Shorty provides a very fast script in perl to merge fastq-sequences in the following way:

      @read_id1/1
      ...
      +
      ...
      @read_id2/2
      ...
      +
      ...
      and so on..

      Code:
      #!/usr/bin/perl
      
      $filenameA = $ARGV[0];
      $filenameB = $ARGV[1];
      $filenameOut = $ARGV[2];
      
      open $FILEA, "< $filenameA";
      open $FILEB, "< $filenameB";
      
      open $OUTFILE, "> $filenameOut";
      
      while(<$FILEA>) {
      	print $OUTFILE $_;
      	$_ = <$FILEA>;
      	print $OUTFILE $_; 
      	$_ = <$FILEA>;
      	print $OUTFILE $_; 
      	$_ = <$FILEA>;
      	print $OUTFILE $_; 
      
      	$_ = <$FILEB>;
      	print $OUTFILE $_; 
      	$_ = <$FILEB>;
      	print $OUTFILE $_;
      	$_ = <$FILEB>;
      	print $OUTFILE $_;
      	$_ = <$FILEB>;
      	print $OUTFILE $_;
      }
      Note: It assumes that both files are of the same size and sequences are in the same order..
      Usage should be: merge.pl file1.fastq file2.fastq out.fastq
      Last edited by Jenzo; 04-07-2011, 12:32 AM.

      Comment


      • #4
        The two posted scripts do slightly different things. The one I posted concatenates the sequences and qualities together so if you started with a 2 x 40bp run then you'd end up with a file of 80bp reads.

        The second script simply places the reads from the two files one after another in the combined file, so you'd end up with a 40bp file which was twice as long. It's roughly equivalent to doing:

        Code:
        cat [file1] [file2] > [outfile]
        except that it puts the equivalent reads next to each other in the final file.

        I guess which one you use depends on how you wanted to combine the files....

        Comment


        • #5
          hehe, thats right ;-) thanks for pointing it out!

          Comment


          • #6
            Thanks for the script Andrew. I tried the script out and it seems that the script joins the files but not in the Paired end fashion.

            Original file (1)

            @HWI-EAS216_0001:1:1:1079:15982#0/1
            TATGCTCTGCCTTGGCTGTGTCATCGTGTTGATGCCAACTGACACGAAACTTCTAGGCTGATTCATCCTAAGTAT
            +
            CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@@CCC>2?>>>A?C@CC7@@@@A<@@@A@@@?C@=CC#######
            @HWI-EAS216_0001:1:1:1079:9356#0/1
            CGCTCAAGAGATGGGCTTTGGGTGCGGAATGGGGATTTGGGTTGTGACCCAATACAGCGGTAGTAGCGTGCAGCA
            +
            BBB=>B=BCCCCCCCCCACCCCBCBCC@BBCCCABC@CCCB@CCA@C?B9C7?@:<@##################

            Original file (2)


            @HWI-EAS216_0001:1:1:1079:15982#0/2
            GTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTTATTGAGTCTGTGTTGAAAAGAAACCACTTACGCATTATACTT
            +
            BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@A?A5C<C@?C=CC4;>A########################
            @HWI-EAS216_0001:1:1:1079:9356#0/2
            GCAGGATTGCCATTCCCATCAGCTTTCTGCTGCACGCTACTACCGCTGTATTGGGTCACAACCCAAATCCCCATT
            +
            CCCCCBBCCCCCACCCCCCCCCC?CCCCBCCCCCCCCCCCBCCCC@ABCCCCCC<C;C>CCCBCCCBC>CCBC>>

            The script from Andrew, does this (putting all the 0/1 reads first)


            @HWI-EAS216_0001:1:1:1079:15982#0/1
            TATGCTCTGCCTTGGCTGTGTCATCGTGTTGATGCCAACTGACACGAAACTTCTAGGCTGATTCATCCTAAGTATGTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTTATTGAGTCTGTGTTGAAAAGAAACCACTTACGCATTATACTT
            +
            CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@@CCC>2?>>>A?C@CC7@@@@A<@@@A@@@?C@=CC#######BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@A?A5C<C@?C=CC4;>A########################
            @HWI-EAS216_0001:1:1:1079:9356#0/1
            CGCTCAAGAGATGGGCTTTGGGTGCGGAATGGGGATTTGGGTTGTGACCCAATACAGCGGTAGTAGCGTGCAGCAGCAGGATTGCCATTCCCATCAGCTTTCTGCTGCACGCTACTACCGCTGTATTGGGTCACAACCCAAATCCCCATT
            +
            BBB=>B=BCCCCCCCCCACCCCBCBCC@BBCCCABC@CCCB@CCA@C?B9C7?@:<@##################CCCCCBBCCCCCACCCCCCCCCC?CCCCBCCCCCCCCCCCBCCCC@ABCCCCCC<C;C>CCCBCCCBC>CCBC>>


            What i want is :

            @HWI-EAS216_0001:1:1:1079:15982#0/1
            TATGCTCTGCCTTGGCTGTGTCATCGTGTTGATGCCAACTGACACGAAACTTCTAGGCTGATTCATCCTAAGTAT
            +
            CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@@CCC>2?>>>A?C@CC7@@@@A<@@@A@@@?C@=CC#######
            @HWI-EAS216_0001:1:1:1079:15982#0/2
            GTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTTATTGAGTCTGTGTTGAAAAGAAACCACTTACGCATTATACTT
            +
            BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@A?A5C<C@?C=CC4;>A########################

            Hope this helps. I know its possible

            Thanks for all the help.

            Comment


            • #7
              Originally posted by newbietonextgen View Post
              What i want is :

              @HWI-EAS216_0001:1:1:1079:15982#0/1
              TATGCTCTGCCTTGGCTGTGTCATCGTGTTGAT
              +
              CCCCCCCCCCCCCCCC@BCCCCCCCCCCCCC@
              @HWI-EAS216_0001:1:1:1079:15982#0/2
              GTTTCTGAAGAGGCAGGCAGCAGAATTTGGTTT
              +
              BCCCCBCCCCCCB7CCCC;9*;8:>?BB<CC<C@
              That's what Jenzo's script would produce isn't it?

              Comment


              • #8
                I think, but i did not try. I used fastq_merge.pl, your script.

                Comment


                • #9
                  I explained in the second note I added that the two scripts posted did different things, and it depended on how you wanted to merge your files. Just out of interest which pipeline are you using which requires the paired files to be placed one after another?

                  Comment


                  • #10
                    Ha, Sorry my mistake. I figured it out. Thanks. SHRiMP requires that paired reads are put one behind the other.

                    Comment


                    • #11
                      Thanks guys. Scarpa too requires a merged fastq with "interleaved" reads (so that reads from the same pair follow each other) and Jenzo's script does that.

                      Comment


                      • #12
                        And what about combining PE reads from multiple runs? I have two runs from the same library and I would like to combine the PE reads into the same file (one file for R1 and one file for R2), keeping the reads separation as per Jenzo's script. Would my code look like something like this?

                        #!/usr/bin/perl

                        $filename_R1_Run1 = $ARGV[0];
                        $filename_R1_Run2 = $ARGV[1];
                        $filename_R1_Runs1And2 = $ARGV[2];

                        open $FILE_R1_Run1, "< $filename_R1_Run1";
                        open $FILE_R1_Run2, "< $filename_R1_Run2";

                        open $FILE_R1_Runs1And2, "> $filename_R1_Runs1And2";

                        while(<$FILE_R1_Run1>) {
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run1>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run1>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run1>;
                        print $FILE_R1_Runs1And2 $_;

                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        $_ = <$FILE_R1_Run2>;
                        print $FILE_R1_Runs1And2 $_;
                        }

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          04-22-2024, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Today, 08:47 AM
                        0 responses
                        12 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        60 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        59 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        54 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X