View Single Post
Old 08-12-2011, 12:47 AM   #70
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

After some runs analyzed with CASAVA 1.8 I have the some considerations. I was a little skeptic about fastq in place of qseq, especially because the PF information was coded as a column (that I could easily filter with awk) while now is in the sequence ID. We've dropped any srf reference and decided to give fastq.gz a try. CASAVA official documents state I could filter QC-fails just like this:

Code:
for fastq in *.fastq.gz ; do zcat $fastq | grep
-A 4 '^@.* [^:]*:N:[^:]*:' > filtered_$fastq ; done
Unfortunately doesn't work... this does:

Code:
for fastq in *.fastq.gz ; do zgrep
-A 3 '^@.* [^:]*:N:[^:]*:' $fastq | grep -v -- '^--$' > filtered_$fastq ; done
This said, we've decide to align everything and skip the filtering. Once we have a SAM file, we use this simple awk command:

Code:
awk '{OFS="\t"; if(/:Y:/) $2=$2+512; print $0}'
to mark the QC failing reads just in the alignment file.
I should say we use bwa (and not ELAND) for alignments. Unfortunately bwa reads sequence ID in fastq as words and retains only the first one. This trims the QC info (because Y and N are just after a white space). This is a minor issue: we typically pipe fastq to bwa, now we just add a pipe module that translates spaces to underscores:

Code:
bwa aln GENOME <(zcat FILE.fastq.gz | sed -e "s/ /_/")
I was also skeptic about chunked fastq, especially when I had to deliver data. On the other side I found them very useful when running on a cluster: I can easily spawn multiple alignments to nodes without any additional preprocessing.
dawe is offline   Reply With Quote