Seqanswers Leaderboard Ad

**BAMseek** · 08-30-2011, 07:29 PM

Could you post the first few lines of the file? Is there something in the description line that indicates where to split them? Or are paired reads listed one right after the other?

**Balat** · 08-30-2011, 07:39 PM

The paired reads are listed as first mate read followed by second mate read.

@HWI-ST945:92:d059facxx:8:1101:1567:2217 1:N:0:TGACCA
@HWI-ST945:92:d059facxx:8:1101:1567:2217 2:N:0:TGACCA

**ocs** · 08-31-2011, 03:36 AM

Assuming you have the standard fastq file format with quality scores

Code:

@test1.1
acgt
+test1.1
1234
@test1.2
acgt
+test1.2
1234

Then this should work. Its quick and dirty and there may be more sophisticated solutions, but nevertheless:

Code:

sed -ne '1~8H;2~8H;3~8H;4~8H;${g;s/^\n//;p}' y.fastq > y_1.fastq
sed -ne '5~8H;6~8H;7~8H;8~8H;${g;s/^\n//;p}' y.fastq > y_2.fastq

When you only have lines as you have stated, its more simple:

Code:

sed -ne '1~2p' x.fastq > x_1.fastq
sed -ne '2~2p' x.fastq > x_2.fastq

Both solutions assume that the reads are consecutive.

**swbarnes2** · 08-31-2011, 08:41 AM

You could also do a grep for the line and the three lines following the lines that have the 1:N:0 pattern. But you may have to get rid of the '--' that'll be put in there (though bwa and samtools don't seem to mind them)

**dcfargo** · 08-31-2011, 08:57 AM

With one per line and every other line:

awk '0 == (NR + 1) % 2' infile > end1 &
awk '0 == (NR + 2) % 2' infile > end2 &

**BAMseek** · 08-31-2011, 05:42 PM

Yet another solution. To add to dcfargo's solution, if the file (infile) is indeed in fastq format (4 lines per record, as shown by ocs), then this should work too

awk '0 == ((NR+4) % 8)*((NR+5) % 8)*((NR+6) % 8)*((NR+7) %8)' infile > end1 &
awk '0 == (NR % 8)*((NR+1) % 8)*((NR+2) % 8)*((NR+3) %8)' infile > end2

**Balat** · 08-31-2011, 05:50 PM

Thank you all. Yes the file is fastq format with 4 lines per read. I was able to split my fastq file using both sed and awk commands.

**ocs** · 08-31-2011, 11:48 PM

Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

This also works:

Code:

sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq

Even more simple:

Code:

sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq

Also nice to see some awk solutions! Always exciting to see how things work in awk.

**robp** · 08-23-2013, 10:07 AM

That's a very concise solution! However, I think that the commands should be:

Code:

sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
sed -ne '5~8{N;N;N;p}' x.fastq > x_2.fastq

Where, for the second command I've replaced the 4 with a 5. This is because sed is counting from 1, so the 4th line is actually line line at offset 3, which is not the header for the second mate of the pair.

**skycreative** · 06-20-2016, 06:11 PM

It is so helpful and effective! Great thanks!

Originally posted by ocs View Post

Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

This also works:

Code:

sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq

Even more simple:

Code:

sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq

Also nice to see some awk solutions! Always exciting to see how things work in awk.

**tahia** · 09-22-2016, 07:55 AM

I think grep will be easy if you don't have consecutive read1 and read2

grep -A3 -P "1:N:" --no-group-separator in.fastq >in_1.fastq
grep -A3 -P "2:N:" --no-group-separator in.fastq >in_2.fastq

you can match your pattern as you get read name (/1,_1 or 1:N:#:#)

Best,

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

split fastq file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News