SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
split fastq file Balat Bioinformatics 10 09-22-2016 08:55 AM
Split fastq files for tophat analysis Bobbieshaban Bioinformatics 2 03-12-2013 07:44 AM
Split fastq file with renaming reads empyrean Bioinformatics 1 08-31-2012 08:38 AM
split a fastq file lfaino Bioinformatics 4 04-14-2011 04:28 PM
Split GA FASTQ file aritakum Bioinformatics 3 06-10-2010 05:15 AM

Reply
 
Thread Tools
Old 12-06-2012, 03:41 PM   #1
lorendarith
Guest
 

Posts: n/a
Default Split fastq into smaller files

Dear all,

I'm looking into splitting a FASTQ read file into several smaller sized files. It's basically just distributing batches of 4 lines into a certain number of files.

I'm trying with
Code:
split -l <number of lines per file> <FASTQ>
which works of course, but is too slooooooooooow on a HiSeq read file.

Any recommendations for faster splitting? awk, sed?

Thanks!
  Reply With Quote
Old 12-06-2012, 08:35 PM   #2
ehlin
Member
 
Location: NYC

Join Date: Jan 2012
Posts: 12
Default

I haven't tried this, but would it be faster if you specified a file size rather than a line number?
ehlin is offline   Reply With Quote
Old 12-07-2012, 12:37 AM   #3
pallevillesen
Member
 
Location: Bioinformatics Research Center, Aarhus University, Denmark

Join Date: May 2012
Posts: 19
Default

I really don't see anything faster than split (unless you want to parallelize it and let each subroutine extract certain parts of the file) (using e.g. awk).

But for really large files the time for counting the lines (for input to awk) would also take a lot of time...

I would just split it as you do...

[palle@s01n11 3_adapter_trimmed]$ time split -l 4000000 fastqfile

real 2m17.853s
user 0m2.640s
sys 0m17.980s

For a 16 gig file - that is ok.

>cat fatq |grep -e "^@# |wc -l

69332456 fastq records

Primer for awk:

fq=...
from=0
to=4000000
time cat $fq | awk "NR > $from && NR < $to" >xaa
cat xaa | grep -e "^@" |wc -l

You could do this in a simple for loop in bash and submit each cat|awk to seperate nodes of a cluster.... but I doubt it's worth the hassle.... submit all your splits to a cluster and go grab a cup of coffee...

Edit: time cat $fq | awk "{ if (NR < $from) next; if (NR < $to) print; if (NR >= $to) exit;} " >xaa

Will exit after you extracted the wanted part and is much faster for large files (well - only for the first splits - for the last parts it has to read through the file first).

Last edited by pallevillesen; 12-07-2012 at 12:48 AM. Reason: Added better awk solution
pallevillesen is offline   Reply With Quote
Old 12-08-2012, 02:29 AM   #4
lorendarith
Guest
 

Posts: n/a
Default

Quote:
Originally Posted by ehlin View Post
I haven't tried this, but would it be faster if you specified a file size rather than a line number?
Haven't tried it, but wouldn't this result in truncated FASTQ entries, especially if you are doing it on compressed files to save time?

Quote:
Originally Posted by pallevillesen View Post
I really don't see anything faster than split (unless you want to parallelize it and let each subroutine extract certain parts of the file) (using e.g. awk).
Thanks! Though... 2mins on a 16 Gb file? Tried splitting a 32Gb file and it took HOURS! There must have been something seriously wrong with our file system server...
  Reply With Quote
Old 12-08-2012, 08:21 AM   #5
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 150
Default

Quote:
Originally Posted by lorendarith View Post

Any recommendations for faster splitting? awk, sed?

Thanks!
well it's just reading it into memory and then writing it back, it should be very fast yes you can use awk, how big you want your small files?

you can do something like (bash syntax)

Code:
for  i in `seq 1 10`
do
  awk -v v=$i '{if (NR>(v-1)*400000 && NR<=v*400000) print}' > $i.fastq 
done
That will break 1M read fastq file into ten 100K files.

And it should be very quick, few minutes even for very big files.

PS sorry - you already got the question answered, I'm still asleep apparently

Last edited by apredeus; 12-08-2012 at 10:32 AM.
apredeus is offline   Reply With Quote
Old 12-08-2012, 10:09 AM   #6
lorendarith
Guest
 

Posts: n/a
Default

Quote:
Originally Posted by apredeus View Post
PS sorry - you already got the question answered, I'm still asleep apparently
ALL suggestions are welcomed and appreciated! Thanks
  Reply With Quote
Old 12-08-2012, 10:32 AM   #7
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 150
Default

You're welcome. I've just changed the code a bit, I messed up a variable name within awk.
apredeus is offline   Reply With Quote
Old 12-11-2012, 12:59 AM   #8
pallevillesen
Member
 
Location: Bioinformatics Research Center, Aarhus University, Denmark

Join Date: May 2012
Posts: 19
Default

Quote:
Originally Posted by lorendarith View Post
Haven't tried it, but wouldn't this result in truncated FASTQ entries, especially if you are doing it on compressed files to save time?

Thanks! Though... 2mins on a 16 Gb file? Tried splitting a 32Gb file and it took HOURS! There must have been something seriously wrong with our file system server...
Well... Our cluster is brand new with 80 Gbit network between nodes and the fileserver - that may cause things to run extremely fast here...

Anyway: your problem was solved.
pallevillesen is offline   Reply With Quote
Old 12-13-2012, 03:31 AM   #9
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 623
Default

Quote:
Originally Posted by lorendarith View Post
Haven't tried it, but wouldn't this result in truncated FASTQ entries, especially if you are doing it on compressed files to save time?



Thanks! Though... 2mins on a 16 Gb file? Tried splitting a 32Gb file and it took HOURS! There must have been something seriously wrong with our file system server...
No local storage? NFS?
sklages is offline   Reply With Quote
Old 12-13-2012, 05:12 AM   #10
gsgs
Senior Member
 
Location: germany

Join Date: Oct 2009
Posts: 140
Default

there should be a solution to just change the directory list, file names,
file sizes, while keeping the data where it is
gsgs is offline   Reply With Quote
Old 12-13-2012, 05:28 AM   #11
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 623
Default

Quote:
Originally Posted by gsgs View Post
there should be a solution to just change the directory list, file names,
file sizes, while keeping the data where it is
Sure, but reading a 32G file and maybe rewriting it (in chunks) is terribly slow via NFS ...
sklages is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:05 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO