SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
nested for loop to concatenate fastq files lkomo Bioinformatics 6 04-03-2015 07:18 PM
Concatenate / merge IPS XML files cement_head Bioinformatics 0 12-08-2014 05:54 PM
Concatenate GFF Files mgaldos Bioinformatics 11 05-29-2013 08:55 AM
how to use bash script to make iterative loop through directory with two file types jddavis Bioinformatics 2 05-23-2013 07:02 AM
Concatenate several SRA reads to a single fastq file krespim Bioinformatics 5 07-30-2012 08:04 PM

Reply
 
Thread Tools
Old 12-09-2015, 03:51 PM   #1
rcook34
Junior Member
 
Location: Vancouver, BC

Join Date: Sep 2012
Posts: 7
Default For loop in bash terminal to concatenate fastq files

Hi all,

I am trying to create a bash command to use a for loop to concatenate fastq files split by lanes after demultiplexing. For example, after demultiplexing, my non-concatenated fastq files are as such:

HC13_S13_L001_R1_001.fastq
HC13_S13_L001_R2_001.fastq
HC13_S13_L002_R1_001.fastq
HC13_S13_L002_R2_001.fastq
HC13_S13_L003_R1_001.fastq
HC13_S13_L003_R2_001.fastq
HC13_S13_L004_R1_001.fastq
HC13_S13_L004_R2_001.fastq
HC14_S14_L001_R1_001.fastq
HC14_S14_L001_R2_001.fastq
HC14_S14_L002_R1_001.fastq
HC14_S14_L002_R2_001.fastq
HC14_S14_L003_R1_001.fastq
HC14_S14_L003_R2_001.fastq
HC14_S14_L004_R1_001.fastq
HC14_S14_L004_R2_001.fastq

And I would like to take this and create four files of HC13_R1.fastq, HC13_R2.fastq, HC14_R1.fastq, and HC14_R2.fastq.

This would be very easy to do without a loop, however it is extremely time consuming if I am dealing with 25+ samples at a time.

cat HC13_S13_L00*_R1_001.fastq > HC13_R1.fq

The command I have tried to use to carry this out with a loop is below. It successfully merges the different lanes together, but does not create separate files for each sample, and I am not sure how to work that into my command.

for SUFFIX in R1_001.fastq R2_001.fastq; do cat *L001_$SUFFIX *L002_$SUFFIX *L003_$SUFFIX *L004_$SUFFIX > samplename_cat_$SUFFIX; done

Thanks!

Last edited by rcook34; 12-09-2015 at 04:14 PM.
rcook34 is offline   Reply With Quote
Old 12-09-2015, 06:33 PM   #2
rwan
Member
 
Location: Hong Kong

Join Date: Feb 2013
Posts: 11
Default

Hi,

I don't know if this is the best way as I'm not thinking correctly today, but this should get you started:

#!/bin/bash
SAMPLES="13 14"
READS="R1 R2"

for mysample in ${SAMPLES} ; do
for myread in ${READS} ; do
fn="HC${mysample}_${myread}.fastq"
## Might be a better idea to make the output file in another directory so that the "ls" won't catch it
touch ${fn}
all_files=`ls *${mysample}*${myread}*`
for each_file in ${all_files} ; do
printf "Add %s to %s here...\n" ${each_file} ${fn}
done
done
done

Basically, you need to put two loops inside each other to generate the output file (i.e., "touch"). And then get a list of all files that satisfy that criteria and add that in.

Note that the ORDER of the files from the "ls" command isn't guaranteed. As you probably want them all in the same order, rather than ls all the files, you might want to do a for loop that goes through 001, 002, ... .

All that looks a bit messy...but hopefully it's enough to get you going, and maybe improve it.

Ray
rwan is offline   Reply With Quote
Old 12-09-2015, 06:40 PM   #3
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Code:
for i in {13..14}; do for j in 1 2; do cat HC${i}_S${i}_*_R${j}_001.fastq >  HC${i}_S${i}_R${j}_001.fastq; done; done
I do have a more sophisticated Python script that can handle cases when no information about the filenames is known in advance, other that the location of the sample names and the lane number within the filename.

Last edited by blancha; 12-09-2015 at 07:49 PM. Reason: Improved the elegance of the code
blancha is offline   Reply With Quote
Old 12-11-2015, 10:55 AM   #4
rcook34
Junior Member
 
Location: Vancouver, BC

Join Date: Sep 2012
Posts: 7
Default

Thanks for the help! I was successful with blancha's command, I do also want to try rwan's command as well when I get some time to look at this more next week.

blancha: I do have a follow up if you don't mind. I was interested in adapting this code to also work for samples where the sample name and sample number don't match (ie if the list was actually named like this)

HC13_S5_L001_R1_001.fastq
HC13_S5_L001_R2_001.fastq
HC13_S5_L002_R1_001.fastq
HC13_S5_L002_R2_001.fastq
HC13_S5_L003_R1_001.fastq
HC13_S5_L003_R2_001.fastq
HC13_S5_L004_R1_001.fastq
HC13_S5_L004_R2_001.fastq
HC14_S19_L001_R1_001.fastq
HC14_S19_L001_R2_001.fastq
HC14_S19_L002_R1_001.fastq
HC14_S19_L002_R2_001.fastq
HC14_S19_L003_R1_001.fastq
HC14_S19_L003_R2_001.fastq
HC14_S19_L004_R1_001.fastq
HC14_S19_L004_R2_001.fastq

where the sample name could be anything and sample number could be S <number from 1-24>
So I modified your code (below), which is successful, but also creates empty files for HC13_S19... and HC14_S5... I imagine this is because the loop for S sample number is embedded within the loop for sample name... not sure if there is a quick edit around this?

for i in {13,17}; do for k in {5,19}; do for j in 1 2; do cat HC${i}_S${k}_*_R${j}_001.fastq > HC${i}_S${k}_R${j}_001.fq; done; done; done

Much appreciated!!!
rcook34 is offline   Reply With Quote
Old 12-11-2015, 01:34 PM   #5
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

I find it quicker to just switch to Python, when the scripts become too complicated.

I have a Python script that splits the filename, into sample name and lane number. It then groups together the filenames with the same sample name, and writes a bash command to concatenate together the FASTQ files with the same sample name, but different lane numbers.

If you want to continue using Bash, there are many possible solutions.
The quickest is just two have two distinct loops, by just copy-pasting the code for the loop a second time.
I'm sure there could be more elegant code, but someone else will have to come up with it.

It took me a while to figure out the few lines of Python code to parse the filename, but it is much more robust.
blancha is offline   Reply With Quote
Old 02-04-2016, 06:40 PM   #6
wyll
Junior Member
 
Location: TW

Join Date: Feb 2016
Posts: 6
Default

Hello, I think this is a for loop question so I post it here

I am trying to run fastqc and trimmomatic in sequence

My 1st attempt
#!/bin/bash

sequence= home/guest/guest06/p04448013/other/raw/frag_1.fastq.gz; home/guest/guest06/p04448013/other/raw/frag_2.fastq.gz

com1=fastqc $sequence
com2=java -jar ~/bin/Trimmomatic-0.33/trimmomatic-0.33.jar PE -phred33 $sequence CROP:75

for sequence in $sequence
do
com1 && com2
done

The error message came up saying "no such file or directory" but the files are there

=======================================
My 2nd attempt

#!/bin/bash
samples="frag"
reads="1 2"

for mysample in {frag}; do
for myread in {1 2}; do
for p in {p up}; do
com1=fastqc home/guest/guest06/p04448013/other/raw/${mysample}_${myread}.fastq.gz
com2=java -jar ~/bin/Trimmomatic-0.33/trimmomatic-0.33.jar PE -phred33 home/guest/guest06/p04448013/other/raw/${mysample}_${myread}.fastq.gz home/guest/guest06/p04448013/other/trimm/${mysample}_${myread}_${p}.fastq.gz CROP:75
for
do
com1 && com2
done
done
done
done

Now it says
/home/guest/guest06/p04448013/other/practice/run.sh: line 14: syntax error near unexpected token `newline'
/home/guest/guest06/p04448013/other/practice/run.sh: line 14: `for '

Which attempt is better and how do I fix the problems?
Thank you!

Last edited by wyll; 02-04-2016 at 06:44 PM.
wyll is offline   Reply With Quote
Old 02-04-2016, 07:41 PM   #7
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

You need something between "for" and "do" in the innermost loop. [ I think ]
Do you even need it?
Richard Finney is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:55 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO