Hi,
With my rudimentary linux skills I wanted to create a script that pre-processed my ddRAD data by extracting only paired end sequences that had the correct restriction sites in the correct positions (as STACKS only looks for one).
The script I came up with shown below does work on my test data set, however when I run it using my large fastq data files the process "grep -A 3 -E -f coordinates1.txt R2.fastq" peaks at using ~40GB RAM. I killed it after about 30 mins. There must be something inherently inefficient about the way grep uses an array in file as the search term? Bare in mind there are millions of search terms and lines to search.
I wonder would there be a more efficient way of doing this using a loop in Perl?
Any advice would be appreciated.
Cheers
##################################
##Gives cluster corrdinates of sequences that have the CATGC restriction site (RS) directly after the first 5 bases (INDEX/BARCODE)
grep @M00 -A 1 --no-group-separator R1.fastq|sed 's/^.....//'|grep '\<CATGC' -B 1|grep 14:3|cut -c 29-38 >coordinates1.txt
###Gives corrdinates of PE sequnces that contain CATGC in correct position of R1 and AATTC RS in the first 5 bases of R2
grep -A 3 -E -f coordinates1.txt R2.fastq|grep '\<AATTC' -A 2 -B1|grep M00| cut -c 34-44 > coordinates2.txt
###De-multiplexes the raw fastq and pulls out R1 and R2 with correct RS
grep -A 3 -E -f coordinates2.txt R1.fastq > R1_correct_RS.fastq
grep -A 3 -E -f coordinates2.txt R2.fastq > R2_correct_RS.fastq
echo "Number of RS R1:"
grep -c '\<CATGC' R1_correct_RS.fastq
echo "Number of RS R2:"
grep -c '\<AATTC' R2_correct_RS.fastq
echo "These numbers should match!"
rm coordinates1.txt
rm coordinates2.txt
With my rudimentary linux skills I wanted to create a script that pre-processed my ddRAD data by extracting only paired end sequences that had the correct restriction sites in the correct positions (as STACKS only looks for one).
The script I came up with shown below does work on my test data set, however when I run it using my large fastq data files the process "grep -A 3 -E -f coordinates1.txt R2.fastq" peaks at using ~40GB RAM. I killed it after about 30 mins. There must be something inherently inefficient about the way grep uses an array in file as the search term? Bare in mind there are millions of search terms and lines to search.
I wonder would there be a more efficient way of doing this using a loop in Perl?
Any advice would be appreciated.
Cheers
##################################
##Gives cluster corrdinates of sequences that have the CATGC restriction site (RS) directly after the first 5 bases (INDEX/BARCODE)
grep @M00 -A 1 --no-group-separator R1.fastq|sed 's/^.....//'|grep '\<CATGC' -B 1|grep 14:3|cut -c 29-38 >coordinates1.txt
###Gives corrdinates of PE sequnces that contain CATGC in correct position of R1 and AATTC RS in the first 5 bases of R2
grep -A 3 -E -f coordinates1.txt R2.fastq|grep '\<AATTC' -A 2 -B1|grep M00| cut -c 34-44 > coordinates2.txt
###De-multiplexes the raw fastq and pulls out R1 and R2 with correct RS
grep -A 3 -E -f coordinates2.txt R1.fastq > R1_correct_RS.fastq
grep -A 3 -E -f coordinates2.txt R2.fastq > R2_correct_RS.fastq
echo "Number of RS R1:"
grep -c '\<CATGC' R1_correct_RS.fastq
echo "Number of RS R2:"
grep -c '\<AATTC' R2_correct_RS.fastq
echo "These numbers should match!"
rm coordinates1.txt
rm coordinates2.txt
Comment