SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Split a SAM file rahul Bioinformatics 6 12-20-2011 12:12 PM
split a fastq file lfaino Bioinformatics 4 04-14-2011 04:28 PM
Split fastq to fasta and qual file? ewilbanks Bioinformatics 8 01-07-2011 03:02 AM
how to split BED file according to chromsome sunsnow86 Bioinformatics 4 11-30-2010 03:39 PM
Split GA FASTQ file aritakum Bioinformatics 3 06-10-2010 05:15 AM

Reply
 
Thread Tools
Old 08-30-2011, 07:58 PM   #1
Balat
Member
 
Location: Australia

Join Date: May 2010
Posts: 36
Default split fastq file

Hi,
I have a single fastq file with both mate pairs of paired end reads. I would like to split this file into two files each containing one of the two pairs. I have looked into Galaxy, but it needs the read pairs of equal size.

Any one has a script for splitting a fastq file?

Thank you.
Balat is offline   Reply With Quote
Old 08-30-2011, 08:29 PM   #2
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Could you post the first few lines of the file? Is there something in the description line that indicates where to split them? Or are paired reads listed one right after the other?
BAMseek is offline   Reply With Quote
Old 08-30-2011, 08:39 PM   #3
Balat
Member
 
Location: Australia

Join Date: May 2010
Posts: 36
Default

The paired reads are listed as first mate read followed by second mate read.

@HWI-ST945:92:d059facxx:8:1101:1567:2217 1:N:0:TGACCA
@HWI-ST945:92:d059facxx:8:1101:1567:2217 2:N:0:TGACCA
Balat is offline   Reply With Quote
Old 08-31-2011, 04:36 AM   #4
ocs
Member
 
Location: Berlin, Germany

Join Date: May 2011
Posts: 27
Default

Assuming you have the standard fastq file format with quality scores
Code:
@test1.1
acgt
+test1.1
1234
@test1.2
acgt
+test1.2
1234
Then this should work. Its quick and dirty and there may be more sophisticated solutions, but nevertheless:
Code:
sed -ne '1~8H;2~8H;3~8H;4~8H;${g;s/^\n//;p}' y.fastq > y_1.fastq
sed -ne '5~8H;6~8H;7~8H;8~8H;${g;s/^\n//;p}' y.fastq > y_2.fastq
When you only have lines as you have stated, its more simple:
Code:
sed -ne '1~2p' x.fastq > x_1.fastq
sed -ne '2~2p' x.fastq > x_2.fastq
Both solutions assume that the reads are consecutive.
ocs is offline   Reply With Quote
Old 08-31-2011, 09:41 AM   #5
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

You could also do a grep for the line and the three lines following the lines that have the 1:N:0 pattern. But you may have to get rid of the '--' that'll be put in there (though bwa and samtools don't seem to mind them)
swbarnes2 is offline   Reply With Quote
Old 08-31-2011, 09:57 AM   #6
dcfargo
Member
 
Location: Chapel Hill

Join Date: Aug 2008
Posts: 22
Default

With one per line and every other line:

awk '0 == (NR + 1) % 2' infile > end1 &
awk '0 == (NR + 2) % 2' infile > end2 &

Last edited by dcfargo; 08-31-2011 at 10:03 AM.
dcfargo is offline   Reply With Quote
Old 08-31-2011, 06:42 PM   #7
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Yet another solution. To add to dcfargo's solution, if the file (infile) is indeed in fastq format (4 lines per record, as shown by ocs), then this should work too

Quote:
awk '0 == ((NR+4) % 8)*((NR+5) % 8)*((NR+6) % 8)*((NR+7) %8)' infile > end1 &
awk '0 == (NR % 8)*((NR+1) % 8)*((NR+2) % 8)*((NR+3) %8)' infile > end2
BAMseek is offline   Reply With Quote
Old 08-31-2011, 06:50 PM   #8
Balat
Member
 
Location: Australia

Join Date: May 2010
Posts: 36
Default

Thank you all. Yes the file is fastq format with 4 lines per read. I was able to split my fastq file using both sed and awk commands.
Balat is offline   Reply With Quote
Old 09-01-2011, 12:48 AM   #9
ocs
Member
 
Location: Berlin, Germany

Join Date: May 2011
Posts: 27
Default

Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

This also works:
Code:
sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
Even more simple:
Code:
sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
Also nice to see some awk solutions! Always exciting to see how things work in awk.
ocs is offline   Reply With Quote
Old 08-23-2013, 11:07 AM   #10
robp
Member
 
Location: Stony Brook, NY

Join Date: Aug 2013
Posts: 12
Default

That's a very concise solution! However, I think that the commands should be:

Code:
sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
sed -ne '5~8{N;N;N;p}' x.fastq > x_2.fastq
Where, for the second command I've replaced the 4 with a 5. This is because sed is counting from 1, so the 4th line is actually line line at offset 3, which is not the header for the second mate of the pair.
robp is offline   Reply With Quote
Old 09-22-2016, 08:55 AM   #11
tahia
Junior Member
 
Location: dhaka

Join Date: Aug 2010
Posts: 2
Default

I think grep will be easy if you don't have consecutive read1 and read2

grep -A3 -P "1:N:" --no-group-separator in.fastq >in_1.fastq
grep -A3 -P "2:N:" --no-group-separator in.fastq >in_2.fastq

you can match your pattern as you get read name (/1,_1 or 1:N:#:#)

Best,
tahia is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:01 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO