Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • For loop in bash terminal to concatenate fastq files

    Hi all,

    I am trying to create a bash command to use a for loop to concatenate fastq files split by lanes after demultiplexing. For example, after demultiplexing, my non-concatenated fastq files are as such:

    HC13_S13_L001_R1_001.fastq
    HC13_S13_L001_R2_001.fastq
    HC13_S13_L002_R1_001.fastq
    HC13_S13_L002_R2_001.fastq
    HC13_S13_L003_R1_001.fastq
    HC13_S13_L003_R2_001.fastq
    HC13_S13_L004_R1_001.fastq
    HC13_S13_L004_R2_001.fastq
    HC14_S14_L001_R1_001.fastq
    HC14_S14_L001_R2_001.fastq
    HC14_S14_L002_R1_001.fastq
    HC14_S14_L002_R2_001.fastq
    HC14_S14_L003_R1_001.fastq
    HC14_S14_L003_R2_001.fastq
    HC14_S14_L004_R1_001.fastq
    HC14_S14_L004_R2_001.fastq

    And I would like to take this and create four files of HC13_R1.fastq, HC13_R2.fastq, HC14_R1.fastq, and HC14_R2.fastq.

    This would be very easy to do without a loop, however it is extremely time consuming if I am dealing with 25+ samples at a time.

    cat HC13_S13_L00*_R1_001.fastq > HC13_R1.fq

    The command I have tried to use to carry this out with a loop is below. It successfully merges the different lanes together, but does not create separate files for each sample, and I am not sure how to work that into my command.

    for SUFFIX in R1_001.fastq R2_001.fastq; do cat *L001_$SUFFIX *L002_$SUFFIX *L003_$SUFFIX *L004_$SUFFIX > samplename_cat_$SUFFIX; done

    Thanks!
    Last edited by rcook34; 12-09-2015, 05:14 PM.

  • #2
    Hi,

    I don't know if this is the best way as I'm not thinking correctly today, but this should get you started:

    #!/bin/bash
    SAMPLES="13 14"
    READS="R1 R2"

    for mysample in ${SAMPLES} ; do
    for myread in ${READS} ; do
    fn="HC${mysample}_${myread}.fastq"
    ## Might be a better idea to make the output file in another directory so that the "ls" won't catch it
    touch ${fn}
    all_files=`ls *${mysample}*${myread}*`
    for each_file in ${all_files} ; do
    printf "Add %s to %s here...\n" ${each_file} ${fn}
    done
    done
    done

    Basically, you need to put two loops inside each other to generate the output file (i.e., "touch"). And then get a list of all files that satisfy that criteria and add that in.

    Note that the ORDER of the files from the "ls" command isn't guaranteed. As you probably want them all in the same order, rather than ls all the files, you might want to do a for loop that goes through 001, 002, ... .

    All that looks a bit messy...but hopefully it's enough to get you going, and maybe improve it.

    Ray

    Comment


    • #3
      Code:
      for i in {13..14}; do for j in 1 2; do cat HC${i}_S${i}_*_R${j}_001.fastq >  HC${i}_S${i}_R${j}_001.fastq; done; done
      I do have a more sophisticated Python script that can handle cases when no information about the filenames is known in advance, other that the location of the sample names and the lane number within the filename.
      Last edited by blancha; 12-09-2015, 08:49 PM. Reason: Improved the elegance of the code

      Comment


      • #4
        Thanks for the help! I was successful with blancha's command, I do also want to try rwan's command as well when I get some time to look at this more next week.

        blancha: I do have a follow up if you don't mind. I was interested in adapting this code to also work for samples where the sample name and sample number don't match (ie if the list was actually named like this)

        HC13_S5_L001_R1_001.fastq
        HC13_S5_L001_R2_001.fastq
        HC13_S5_L002_R1_001.fastq
        HC13_S5_L002_R2_001.fastq
        HC13_S5_L003_R1_001.fastq
        HC13_S5_L003_R2_001.fastq
        HC13_S5_L004_R1_001.fastq
        HC13_S5_L004_R2_001.fastq
        HC14_S19_L001_R1_001.fastq
        HC14_S19_L001_R2_001.fastq
        HC14_S19_L002_R1_001.fastq
        HC14_S19_L002_R2_001.fastq
        HC14_S19_L003_R1_001.fastq
        HC14_S19_L003_R2_001.fastq
        HC14_S19_L004_R1_001.fastq
        HC14_S19_L004_R2_001.fastq

        where the sample name could be anything and sample number could be S <number from 1-24>
        So I modified your code (below), which is successful, but also creates empty files for HC13_S19... and HC14_S5... I imagine this is because the loop for S sample number is embedded within the loop for sample name... not sure if there is a quick edit around this?

        for i in {13,17}; do for k in {5,19}; do for j in 1 2; do cat HC${i}_S${k}_*_R${j}_001.fastq > HC${i}_S${k}_R${j}_001.fq; done; done; done

        Much appreciated!!!

        Comment


        • #5
          I find it quicker to just switch to Python, when the scripts become too complicated.

          I have a Python script that splits the filename, into sample name and lane number. It then groups together the filenames with the same sample name, and writes a bash command to concatenate together the FASTQ files with the same sample name, but different lane numbers.

          If you want to continue using Bash, there are many possible solutions.
          The quickest is just two have two distinct loops, by just copy-pasting the code for the loop a second time.
          I'm sure there could be more elegant code, but someone else will have to come up with it.

          It took me a while to figure out the few lines of Python code to parse the filename, but it is much more robust.

          Comment


          • #6
            Hello, I think this is a for loop question so I post it here

            I am trying to run fastqc and trimmomatic in sequence

            My 1st attempt
            #!/bin/bash

            sequence= home/guest/guest06/p04448013/other/raw/frag_1.fastq.gz; home/guest/guest06/p04448013/other/raw/frag_2.fastq.gz

            com1=fastqc $sequence
            com2=java -jar ~/bin/Trimmomatic-0.33/trimmomatic-0.33.jar PE -phred33 $sequence CROP:75

            for sequence in $sequence
            do
            com1 && com2
            done

            The error message came up saying "no such file or directory" but the files are there

            =======================================
            My 2nd attempt

            #!/bin/bash
            samples="frag"
            reads="1 2"

            for mysample in {frag}; do
            for myread in {1 2}; do
            for p in {p up}; do
            com1=fastqc home/guest/guest06/p04448013/other/raw/${mysample}_${myread}.fastq.gz
            com2=java -jar ~/bin/Trimmomatic-0.33/trimmomatic-0.33.jar PE -phred33 home/guest/guest06/p04448013/other/raw/${mysample}_${myread}.fastq.gz home/guest/guest06/p04448013/other/trimm/${mysample}_${myread}_${p}.fastq.gz CROP:75
            for
            do
            com1 && com2
            done
            done
            done
            done

            Now it says
            /home/guest/guest06/p04448013/other/practice/run.sh: line 14: syntax error near unexpected token `newline'
            /home/guest/guest06/p04448013/other/practice/run.sh: line 14: `for '

            Which attempt is better and how do I fix the problems?
            Thank you!
            Last edited by wyll; 02-04-2016, 07:44 PM.

            Comment


            • #7
              You need something between "for" and "do" in the innermost loop. [ I think ]
              Do you even need it?

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-27-2024, 06:37 PM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-27-2024, 06:07 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              69 views
              0 likes
              Last Post seqadmin  
              Working...
              X