Hi all,
I have a huge fastq file that contains paired reads (after trimming and quality filtering). The reads are shuffled cause they have been used for de-novo assembly with velvet/oases:
Now I would like to reuse that file for mapping with Bowtie. The problem is that I would need the pairs in individual files. I tried several solutions like this or this script:
Problem is that those solutions seem to assume that the pairs are consecutive. Thus, I would need to sort the reads first to provide consecutive pairs and afterwards apply one of the above solutions to split the files.
My question is now: How can I sort the fastq file?
I would appreciate any hint!
Thanks.
I have a huge fastq file that contains paired reads (after trimming and quality filtering). The reads are shuffled cause they have been used for de-novo assembly with velvet/oases:
@HTKZQN1:329:C09BUACXX:6:1101:1442:2235 2:N:0:
TTTTGATTTCTACATTTCATCACTTTTCAGATAATACGATTTTTGAAGATTTTTTCAATGTTATTCGGGAATTATATTCCAA
+
+==A;B?DHHH>4+2CG:<IIA:AC>>C4ACHFEH<EF)CFHF*?1?*: DFG90@;=BCHGI4BFD824'-=EAE;.??;CC
@HTKZQN1:329:C09BUACXX:6:1101:1575:2055 1:N:0:
ATAATTTTGGGTGTTAATACAACAAGGAATCATGCTTTCATATTTGAAAAAATATAGATTAATTATAAAAAATACATTTAATTTGATATTAATGTATAAAA
+
@C@DFDFFHHFACFEHIIIHHIIIJIIJGGHIJJJJJJIIJJJJIIEHCBDFGGEGEIICGCCHGIEDHIHEB:??CBEE;AEEC@A>CCDDDCDADD@D>
@HTKZQN1:329:C09BUACXX:6:1101:1735:2058 1:N:0:
GTGAAGTATAGTAGTTCCATAGGGAATATAGTTAACAAAACACATAAAATCTATAAACTTCAATTTTTCTAGAGCAATAATGTCCCCTTGCAAAAATAAGT
+
@CCFFFFFHHHHHJJJJJHIIJJJGIJJJJJGIIJIJJJJIJJIJJJIJJJJJIIHIJIIJJGIJJJIJJIGHHIHIICHIGHIHHHFFFFFFF@AECEC3
@HTKZQN1:329:C09BUACXX:6:1101:1701:2060 2:N:0:
CTTTGCCATTTAATTCATAAACTGCATCATCAGCATCCCTGTAGTCATCAAATTCCACAAATCCAAAACCATTTTTAATAAGAATCTCTCGTATTTTCCCA
+
CCCFFFFFHGHHHJJJIJJIJJJJFHGJJJJJJJJJGIJJIJJJIIJIJJHIJJJJIIHIJJJJJJJJJCEHIJJJJJJJGHHHHHFFFFFCDEEEEEDDD
TTTTGATTTCTACATTTCATCACTTTTCAGATAATACGATTTTTGAAGATTTTTTCAATGTTATTCGGGAATTATATTCCAA
+
+==A;B?DHHH>4+2CG:<IIA:AC>>C4ACHFEH<EF)CFHF*?1?*: DFG90@;=BCHGI4BFD824'-=EAE;.??;CC
@HTKZQN1:329:C09BUACXX:6:1101:1575:2055 1:N:0:
ATAATTTTGGGTGTTAATACAACAAGGAATCATGCTTTCATATTTGAAAAAATATAGATTAATTATAAAAAATACATTTAATTTGATATTAATGTATAAAA
+
@C@DFDFFHHFACFEHIIIHHIIIJIIJGGHIJJJJJJIIJJJJIIEHCBDFGGEGEIICGCCHGIEDHIHEB:??CBEE;AEEC@A>CCDDDCDADD@D>
@HTKZQN1:329:C09BUACXX:6:1101:1735:2058 1:N:0:
GTGAAGTATAGTAGTTCCATAGGGAATATAGTTAACAAAACACATAAAATCTATAAACTTCAATTTTTCTAGAGCAATAATGTCCCCTTGCAAAAATAAGT
+
@CCFFFFFHHHHHJJJJJHIIJJJGIJJJJJGIIJIJJJJIJJIJJJIJJJJJIIHIJIIJJGIJJJIJJIGHHIHIICHIGHIHHHFFFFFFF@AECEC3
@HTKZQN1:329:C09BUACXX:6:1101:1701:2060 2:N:0:
CTTTGCCATTTAATTCATAAACTGCATCATCAGCATCCCTGTAGTCATCAAATTCCACAAATCCAAAACCATTTTTAATAAGAATCTCTCGTATTTTCCCA
+
CCCFFFFFHGHHHJJJIJJIJJJJFHGJJJJJJJJJGIJJIJJJIIJIJJHIJJJJIIHIJJJJJJJJJCEHIJJJJJJJGHHHHHFFFFFCDEEEEEDDD
Code:
#!/usr/local/bin/perl -w # Daniel Brami # Util to split interlaced FASTQ files into pairs use strict; # Standard lib use IO::File; use File::Basename; my $INPUT=shift; if (!(defined ($INPUT)) || ($INPUT =~ '^\-')){ die "Usage: $0 <interleaved paired FASTQ file>\n"; } my ($name,$path,$suffix) = fileparse($INPUT, qw/fastq FASTQ txt TXT/); my $FH_IN = new IO::File($INPUT, "r") or die "could not open $INPUT: $!\n"; my $FH_OUT1 = new IO::File($name."split1.$suffix", "w") or die "could not open $name.split1$suffix for writing: $!\n"; my $FH_OUT2 = new IO::File($name."split2.$suffix", "w") or die "could not open $name.split2$suffix for writing: $!\n"; my ($recs1, $recs2)= (0,0); my $flipflop = 1; my $counter = 0; my ($line, $TXT); while($line = $FH_IN->getline()){ $TXT .= $line; $counter++; if($counter == 4){ if($flipflop == 1){ print $FH_OUT1 $TXT; ++$recs1; }else{ print $FH_OUT2 $TXT; ++$recs2; } $counter = 0; $TXT = ''; $flipflop *= -1; } } $FH_IN->close(); $FH_OUT1->close(); $FH_OUT2->close(); print STDERR "Processed $recs1 records for pair file 1 and $recs2 records for pair file 2.\n"; if($recs1 != $recs2){ print STDERR "The number of processed records does not match - check input data!"; }exit;
My question is now: How can I sort the fastq file?
I would appreciate any hint!
Thanks.
Comment