Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lorendarith
    • Jun 2026

    Split fastq into smaller files

    Dear all,

    I'm looking into splitting a FASTQ read file into several smaller sized files. It's basically just distributing batches of 4 lines into a certain number of files.

    I'm trying with
    Code:
    split -l <number of lines per file> <FASTQ>
    which works of course, but is too slooooooooooow on a HiSeq read file.

    Any recommendations for faster splitting? awk, sed?

    Thanks!
  • ehlin
    Member
    • Jan 2012
    • 12

    #2
    I haven't tried this, but would it be faster if you specified a file size rather than a line number?

    Comment

    • pallevillesen
      Member
      • May 2012
      • 19

      #3
      I really don't see anything faster than split (unless you want to parallelize it and let each subroutine extract certain parts of the file) (using e.g. awk).

      But for really large files the time for counting the lines (for input to awk) would also take a lot of time...

      I would just split it as you do...

      [palle@s01n11 3_adapter_trimmed]$ time split -l 4000000 fastqfile

      real 2m17.853s
      user 0m2.640s
      sys 0m17.980s

      For a 16 gig file - that is ok.

      >cat fatq |grep -e "^@# |wc -l

      69332456 fastq records

      Primer for awk:

      fq=...
      from=0
      to=4000000
      time cat $fq | awk "NR > $from && NR < $to" >xaa
      cat xaa | grep -e "^@" |wc -l

      You could do this in a simple for loop in bash and submit each cat|awk to seperate nodes of a cluster.... but I doubt it's worth the hassle.... submit all your splits to a cluster and go grab a cup of coffee...

      Edit: time cat $fq | awk "{ if (NR < $from) next; if (NR < $to) print; if (NR >= $to) exit;} " >xaa

      Will exit after you extracted the wanted part and is much faster for large files (well - only for the first splits - for the last parts it has to read through the file first).
      Last edited by pallevillesen; 12-07-2012, 12:48 AM. Reason: Added better awk solution

      Comment

      • lorendarith

        #4
        Originally posted by ehlin View Post
        I haven't tried this, but would it be faster if you specified a file size rather than a line number?
        Haven't tried it, but wouldn't this result in truncated FASTQ entries, especially if you are doing it on compressed files to save time?

        Originally posted by pallevillesen View Post
        I really don't see anything faster than split (unless you want to parallelize it and let each subroutine extract certain parts of the file) (using e.g. awk).
        Thanks! Though... 2mins on a 16 Gb file? Tried splitting a 32Gb file and it took HOURS! There must have been something seriously wrong with our file system server...

        Comment

        • apredeus
          Senior Member
          • Jul 2012
          • 151

          #5
          Originally posted by lorendarith View Post

          Any recommendations for faster splitting? awk, sed?

          Thanks!
          well it's just reading it into memory and then writing it back, it should be very fast yes you can use awk, how big you want your small files?

          you can do something like (bash syntax)

          Code:
          for  i in `seq 1 10`
          do
            awk -v v=$i '{if (NR>(v-1)*400000 && NR<=v*400000) print}' > $i.fastq 
          done
          That will break 1M read fastq file into ten 100K files.

          And it should be very quick, few minutes even for very big files.

          PS sorry - you already got the question answered, I'm still asleep apparently
          Last edited by apredeus; 12-08-2012, 10:32 AM.

          Comment

          • lorendarith

            #6
            Originally posted by apredeus View Post
            PS sorry - you already got the question answered, I'm still asleep apparently
            ALL suggestions are welcomed and appreciated! Thanks

            Comment

            • apredeus
              Senior Member
              • Jul 2012
              • 151

              #7
              You're welcome. I've just changed the code a bit, I messed up a variable name within awk.

              Comment

              • pallevillesen
                Member
                • May 2012
                • 19

                #8
                Originally posted by lorendarith View Post
                Haven't tried it, but wouldn't this result in truncated FASTQ entries, especially if you are doing it on compressed files to save time?

                Thanks! Though... 2mins on a 16 Gb file? Tried splitting a 32Gb file and it took HOURS! There must have been something seriously wrong with our file system server...
                Well... Our cluster is brand new with 80 Gbit network between nodes and the fileserver - that may cause things to run extremely fast here...

                Anyway: your problem was solved.

                Comment

                • sklages
                  Senior Member
                  • May 2008
                  • 628

                  #9
                  Originally posted by lorendarith View Post
                  Haven't tried it, but wouldn't this result in truncated FASTQ entries, especially if you are doing it on compressed files to save time?



                  Thanks! Though... 2mins on a 16 Gb file? Tried splitting a 32Gb file and it took HOURS! There must have been something seriously wrong with our file system server...
                  No local storage? NFS?

                  Comment

                  • gsgs
                    Senior Member
                    • Oct 2009
                    • 139

                    #10
                    there should be a solution to just change the directory list, file names,
                    file sizes, while keeping the data where it is

                    Comment

                    • sklages
                      Senior Member
                      • May 2008
                      • 628

                      #11
                      Originally posted by gsgs View Post
                      there should be a solution to just change the directory list, file names,
                      file sizes, while keeping the data where it is
                      Sure, but reading a 32G file and maybe rewriting it (in chunks) is terribly slow via NFS ...

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        06-02-2026, 10:05 AM
                      • SEQadmin2
                        Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                        by SEQadmin2


                        With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                        Introduction

                        Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                        05-22-2026, 06:42 AM
                      • SEQadmin2
                        Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                        by SEQadmin2

                        Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                        Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                        05-06-2026, 09:04 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, Today, 08:59 AM
                      0 responses
                      9 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-02-2026, 12:03 PM
                      0 responses
                      21 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-02-2026, 11:40 AM
                      0 responses
                      17 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 05-28-2026, 11:40 AM
                      0 responses
                      30 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...