Unconfigured Ad

**ehlin** · 12-06-2012, 08:35 PM

I haven't tried this, but would it be faster if you specified a file size rather than a line number?

**pallevillesen** · 12-07-2012, 12:37 AM

I really don't see anything faster than split (unless you want to parallelize it and let each subroutine extract certain parts of the file) (using e.g. awk).

But for really large files the time for counting the lines (for input to awk) would also take a lot of time...

I would just split it as you do...

[palle@s01n11 3_adapter_trimmed]$ time split -l 4000000 fastqfile

real 2m17.853s
user 0m2.640s
sys 0m17.980s

For a 16 gig file - that is ok.

>cat fatq |grep -e "^@# |wc -l

69332456 fastq records

Primer for awk:

fq=...
from=0
to=4000000
time cat $fq | awk "NR > $from && NR < $to" >xaa
cat xaa | grep -e "^@" |wc -l

You could do this in a simple for loop in bash and submit each cat|awk to seperate nodes of a cluster.... but I doubt it's worth the hassle.... submit all your splits to a cluster and go grab a cup of coffee...

Edit: time cat $fq | awk "{ if (NR < $from) next; if (NR < $to) print; if (NR >= $to) exit;} " >xaa

Will exit after you extracted the wanted part and is much faster for large files (well - only for the first splits - for the last parts it has to read through the file first).

**lorendarith** · 12-08-2012, 02:29 AM

Originally posted by ehlin View Post

I haven't tried this, but would it be faster if you specified a file size rather than a line number?

Haven't tried it, but wouldn't this result in truncated FASTQ entries, especially if you are doing it on compressed files to save time?

Originally posted by pallevillesen View Post

I really don't see anything faster than split (unless you want to parallelize it and let each subroutine extract certain parts of the file) (using e.g. awk).

Thanks!

Though... 2mins on a 16 Gb file?

Tried splitting a 32Gb file and it took HOURS! There must have been something seriously wrong with our file system server...

**apredeus** · 12-08-2012, 08:21 AM

Originally posted by lorendarith View Post

Any recommendations for faster splitting? awk, sed?

Thanks!

well it's just reading it into memory and then writing it back, it should be very fast

yes you can use awk, how big you want your small files?

you can do something like (bash syntax)

Code:

for  i in `seq 1 10`
do
  awk -v v=$i '{if (NR>(v-1)*400000 && NR<=v*400000) print}' > $i.fastq 
done

That will break 1M read fastq file into ten 100K files.

And it should be very quick, few minutes even for very big files.

PS sorry - you already got the question answered, I'm still asleep apparently

**lorendarith** · 12-08-2012, 10:09 AM

Originally posted by apredeus View Post

PS sorry - you already got the question answered, I'm still asleep apparently

ALL suggestions are welcomed and appreciated! Thanks

**apredeus** · 12-08-2012, 10:32 AM

You're welcome. I've just changed the code a bit, I messed up a variable name within awk.

**pallevillesen** · 12-11-2012, 12:59 AM

Originally posted by lorendarith View Post

Haven't tried it, but wouldn't this result in truncated FASTQ entries, especially if you are doing it on compressed files to save time?

Thanks!

Though... 2mins on a 16 Gb file?

Tried splitting a 32Gb file and it took HOURS! There must have been something seriously wrong with our file system server...

Well... Our cluster is brand new with 80 Gbit network between nodes and the fileserver - that may cause things to run extremely fast here...

Anyway: your problem was solved.

**sklages** · 12-13-2012, 03:31 AM

Originally posted by lorendarith View Post

Haven't tried it, but wouldn't this result in truncated FASTQ entries, especially if you are doing it on compressed files to save time?

Thanks!

Though... 2mins on a 16 Gb file?

Tried splitting a 32Gb file and it took HOURS! There must have been something seriously wrong with our file system server...

No local storage? NFS?

**gsgs** · 12-13-2012, 05:12 AM

there should be a solution to just change the directory list, file names,
file sizes, while keeping the data where it is

**sklages** · 12-13-2012, 05:28 AM

Originally posted by gsgs View Post

there should be a solution to just change the directory list, file names,
file sizes, while keeping the data where it is

Sure, but reading a 32G file and maybe rewriting it (in chunks) is terribly slow via NFS ...

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Yesterday, 05:37 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 Yesterday, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 52 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 110 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

Split fastq into smaller files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News