Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Code to split paired end MiSeq data?

    Hello all,

    I have some paired end MiSeq data that I have cleaned/trimmed. However the software I used to clean everything up requires that the paired end data files be concatenated into one fastq with all reads and their respective mates, and now I need to split them back into two fastq's each containing one read from a paired end set (R1 and R2).

    My fastq file with all paired end reads follows a typical fastq format, example from "head" of original fastq file:

    @M00763:6:000000000-A1U80:1:1101:12620:1732 1:N:0:1
    TTATACTC
    +
    @A@AA@A@
    @M00763:6:000000000-A1U80:1:1101:12620:1732 2:N:0:1
    T
    +
    E

    Where the bolded "1" or "2" indicate which member of the pair the read is, so I am trying to get all 1's into a separate file and all 2's into a separate file. I have a bit of perl code that a lab mate passed along to me that is supposed to split the reads in this way, and it runs and creates two separate files, but it is not splitting the data correctly based on the 1 and 2 in each header and I get a mixture in each file the program creates. I was wondering if anyone had any idea how to fix this? The code is below.

    Code:
    use strict;
    use warnings;
    
    my $readdata = $ARGV[0];
    open(FILE, "<$readdata") || die "cannot open $readdata\n"; 
    open(OUT1, ">$readdata\_1") || die "cannot open $readdata\_1\n"; 
    open(OUT2, ">$readdata\_2") || die "cannot open $readdata\_2\n";
    
    while(<FILE>){
    	chomp;
    	print OUT1 "$_\/1\n";
    	print OUT2 "$_\/2\n";
    	
    	my $newline = <FILE>; chomp($newline); 
    	print OUT1 substr($newline, 0, length($newline)/2)."\n";
    	print OUT2 substr($newline, length($newline)/2, length($newline)/2)."\n";
    	
    	$newline = <FILE>; chomp($newline); 
    	print OUT1 "$newline\/1\/n";
    	print OUT2 "$newline\/2\/n";
    	
    	$newline = <FILE>; chomp($newline);
    	print OUT1 substr($newline, 0, length($newline)/2)."\n";
    	print OUT2 substr($newline, length($newline)/2, length($newline)/2)."\n";
    	
    }
    
    close(FILE)
    Thanks so much for your help,

    ~Ana

  • #2
    Originally posted by akjones View Post
    Where the bolded "1" or "2" indicate which member of the pair the read is, so I am trying to get all 1's into a separate file and all 2's into a separate file.
    Hi Ana,

    This is not perl but it should work, assuming you are on a Linux/Unix machine:

    Code:
    paste - - - - < test.fq \
    | tee >(awk 'BEGIN{FS="\t"; OFS="\n"} {if (match($1, " 1:N")) print $1,$2,$3,$4}' > test.r1.fq ) \
    | awk 'BEGIN{FS="\t"; OFS="\n"} {if (match($1, " 2:N")) print $1,$2,$3,$4}' > test.r2.fq
    test.fq is your merged input file. test.r1.fq and test.r2.fq are the split files. I think the issue is to set the matching pattern to be specific enough to correctly separate read 1 from read 2 given the read name. Here I set the patterns to " 1:N" and " 2:N" for read 1 and 2 respectively.

    Loosely related to this question, take care that most programs expect the paired fastq files to have read 1 and read 2 in the same order (e.g. most aligners) and your pipeline above seems to break this requirement.

    All the best
    Dario

    Comment


    • #3
      Hi Dario,

      Thanks very much, your code works! What do you mean by
      Loosely related to this question, take care that most programs expect the paired fastq files to have read 1 and read 2 in the same order (e.g. most aligners) and your pipeline above seems to break this requirement.
      are you referring to the direction of the reads (5'-3' or 3'-5') or the order of the reads themselves, as in read 1 followed by read 2 of a pair? Sorry if that is a silly question, I just want to make sure I am understanding you correctly.

      Thanks again,
      ~Ana

      Comment


      • #4
        Originally posted by akjones View Post
        Hi Dario,

        Thanks very much, your code works! What do you mean by are you referring to the direction of the reads (5'-3' or 3'-5') or the order of the reads themselves, as in read 1 followed by read 2 of a pair? Sorry if that is a silly question, I just want to make sure I am understanding you correctly.

        Thanks again,
        ~Ana
        Hi- The fastq files mate 1 should have the same number and order of reads as in the fastq for mate 2. So, if a read is missing from one file, its mate should be removed from the other file as well to keep the correct pairing.

        E.g. Say this is your file for mate 1:
        read_1.fq:
        Code:
        @read1 1:N
        ACTG
        +
        IIII
        @read2 1:N
        ACTG
        +
        IIII
        @read3 1:N
        ACTG
        +
        IIII
        This file for mate 2 would be ok:
        Code:
        @read1 2:N
        AAAA
        +
        IIII
        @read2 2:N
        TTTT
        +
        IIII
        @read3 2:N
        CCCC
        +
        IIII
        This file for mate 2 would be "wrong" as one read is missing:
        Code:
        @read1 2:N
        ACTG
        +
        IIII
        @read3 2:N
        ACTG
        +
        IIII
        Last edited by dariober; 12-04-2013, 09:18 AM. Reason: mate corrected

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        29 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X