Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • split fastq file

    Hi,
    I have a single fastq file with both mate pairs of paired end reads. I would like to split this file into two files each containing one of the two pairs. I have looked into Galaxy, but it needs the read pairs of equal size.

    Any one has a script for splitting a fastq file?

    Thank you.

  • #2
    Could you post the first few lines of the file? Is there something in the description line that indicates where to split them? Or are paired reads listed one right after the other?

    Comment


    • #3
      The paired reads are listed as first mate read followed by second mate read.

      @HWI-ST945:92:d059facxx:8:1101:1567:2217 1:N:0:TGACCA
      @HWI-ST945:92:d059facxx:8:1101:1567:2217 2:N:0:TGACCA

      Comment


      • #4
        Assuming you have the standard fastq file format with quality scores
        Code:
        @test1.1
        acgt
        +test1.1
        1234
        @test1.2
        acgt
        +test1.2
        1234
        Then this should work. Its quick and dirty and there may be more sophisticated solutions, but nevertheless:
        Code:
        sed -ne '1~8H;2~8H;3~8H;4~8H;${g;s/^\n//;p}' y.fastq > y_1.fastq
        sed -ne '5~8H;6~8H;7~8H;8~8H;${g;s/^\n//;p}' y.fastq > y_2.fastq
        When you only have lines as you have stated, its more simple:
        Code:
        sed -ne '1~2p' x.fastq > x_1.fastq
        sed -ne '2~2p' x.fastq > x_2.fastq
        Both solutions assume that the reads are consecutive.

        Comment


        • #5
          You could also do a grep for the line and the three lines following the lines that have the 1:N:0 pattern. But you may have to get rid of the '--' that'll be put in there (though bwa and samtools don't seem to mind them)

          Comment


          • #6
            With one per line and every other line:

            awk '0 == (NR + 1) % 2' infile > end1 &
            awk '0 == (NR + 2) % 2' infile > end2 &
            Last edited by dcfargo; 08-31-2011, 09:03 AM.

            Comment


            • #7
              Yet another solution. To add to dcfargo's solution, if the file (infile) is indeed in fastq format (4 lines per record, as shown by ocs), then this should work too

              awk '0 == ((NR+4) % 8)*((NR+5) % 8)*((NR+6) % 8)*((NR+7) %8)' infile > end1 &
              awk '0 == (NR % 8)*((NR+1) % 8)*((NR+2) % 8)*((NR+3) %8)' infile > end2

              Comment


              • #8
                Thank you all. Yes the file is fastq format with 4 lines per read. I was able to split my fastq file using both sed and awk commands.

                Comment


                • #9
                  Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

                  This also works:
                  Code:
                  sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
                  sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
                  Even more simple:
                  Code:
                  sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                  sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
                  Also nice to see some awk solutions! Always exciting to see how things work in awk.

                  Comment


                  • #10
                    That's a very concise solution! However, I think that the commands should be:

                    Code:
                    sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                    sed -ne '5~8{N;N;N;p}' x.fastq > x_2.fastq
                    Where, for the second command I've replaced the 4 with a 5. This is because sed is counting from 1, so the 4th line is actually line line at offset 3, which is not the header for the second mate of the pair.

                    Comment


                    • #11
                      It is so helpful and effective! Great thanks!
                      Originally posted by ocs View Post
                      Just for curiosity I tried a bit more with sed and came up with more simpler solutions (for those who are interested). My inital solution is quite complicated.

                      This also works:
                      Code:
                      sed -ne '1~8p;2~8p;3~8p;4~8p' x.fastq > x_1.fastq
                      sed -ne '5~8p;6~8p;7~8p;8~8p' x.fastq > x_2.fastq
                      Even more simple:
                      Code:
                      sed -ne '1~8{N;N;N;p}' x.fastq > x_1.fastq
                      sed -ne '4~8{N;N;N;p}' x.fastq > x_2.fastq
                      Also nice to see some awk solutions! Always exciting to see how things work in awk.

                      Comment


                      • #12
                        I think grep will be easy if you don't have consecutive read1 and read2

                        grep -A3 -P "1:N:" --no-group-separator in.fastq >in_1.fastq
                        grep -A3 -P "2:N:" --no-group-separator in.fastq >in_2.fastq

                        you can match your pattern as you get read name (/1,_1 or 1:N:#:#)

                        Best,

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin


                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                          Yesterday, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        37 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        41 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        35 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        54 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X