![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to do CASAVA alignment by using fastq files | weasteam | Bioinformatics | 2 | 01-03-2012 12:18 PM |
a question about merge bam files | camelbbs | Bioinformatics | 2 | 10-24-2011 10:00 AM |
Question on GTF files | gen2prot | Bioinformatics | 1 | 12-29-2010 07:54 PM |
fastq files generated by Casava-Eland | casava | Bioinformatics | 1 | 11-19-2010 05:56 AM |
Generating a BAM file from Illumina export files in CASAVa 1.7 | nirav99 | Bioinformatics | 1 | 09-10-2010 02:20 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: AZ Join Date: Sep 2011
Posts: 4
|
![]()
Quick question regarding the output files after CASAVA converts them to fastq (from BCL).
We have data from several lanes, and each lane has its own folder (e.g. Project_FC/Sample_lane1 followed by Sample_lane2, etc...). In each folder, there are several files. The data is paired reads, so we have R1 and R2 for read 1 and read 2 respectively. However, CASAVA splits each read into separate fastq files numbered sequentially, e.g.: lane1_NoIndex_L001_R1_001.fastq lane1_NoIndex_L001_R1_002.fastq lane1_NoIndex_L001_R1_003.fastq lane1_NoIndex_L001_R1_004.fastq lane1_NoIndex_L001_R1_005.fastq lane1_NoIndex_L001_R1_006.fastq lane1_NoIndex_L001_R1_007.fastq lane1_NoIndex_L001_R1_008.fastq lane1_NoIndex_L001_R1_009.fastq So in this case, going from 001 to 009. In other lanes, it might go from 001 to 011, and so on. Do you know why the fastq files were split up? Is it purely due to the large file size (and if so, can they simply be "cat" together)? Thanks! |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
Yes, CASAVA splits them up solely to limit the size of the files. It is primarily done because Illumina's alignment program, Eland, can't deal with larger data sets well. And yes you can simply cat them together. It is possible to simply cat together the gzipped files; some programs will work fine with those but others won't. To be completely safe you should unzip | concatenate | gzip. You can do this all in a single unix pipe
Code:
zcat lane1_NoIndex_L001_R1_00?.fastq.gz | gzip > lane1_NoIndex_L001_R1.fastq.gz |
![]() |
![]() |
![]() |
#3 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
It it's any help to anyone the script below can be used to concatenate and filter all of the fastq files from a Casava 1.8 run in one go. You just pass it the full list of all fastq files:
eg: combine_fastq [run_folder]/Unaligned/Project*/Sample*/*fastq.gz Code:
#!/usr/bin/perl use warnings; use strict; my @files = @ARGV; my @groups = group_files(@files); foreach my $group (@groups) { warn "Writing to ".$group->{name}."\n"; open OUT, '>', $group->{name} or die "Can't write to ".$group->name().$!; foreach my $file (@{$group->{files}}) { warn "Filtering $file\n"; open (IN,"zcat $file |") or die "Can't read from $file: $!"; while (<IN>) { if (/:Y:/) { $_ = <IN>; $_ = <IN>; $_ = <IN>; } else { print OUT; print OUT scalar <IN>; print OUT scalar <IN>; print OUT scalar <IN>; } } } close OUT or die "Failed to write to ".$group->{name}.":$!"; warn "Compressing ".$group->{name}."\n"; system("gzip ".$group->{name}) == 0 or die "Failed to compress ".$group->{name}."\n"; } sub group_files { my @files = @_; my %groups; foreach my $file (@files) { my $basename = $file; $basename =~ s/_\d{3}\.fastq.gz$/.fastq/; if ($basename eq $file) { warn "'file' didn't look like a casava file\n"; } unless (exists $groups{$basename}) { $groups{$basename} = {name => $basename}; } push @{$groups{$basename}->{files}},$file; } return values %groups; } |
![]() |
![]() |
![]() |
#4 |
Member
Location: Boston, MA Join Date: Feb 2010
Posts: 10
|
![]()
Related warning: trying to avoid the need to combine files with a large value of --fastq-cluster-count causes CASAVA 1.8 BCL conversion/demultiplexing -- not just ELAND -- to silently and unpredictably lose data. So I use simonandrew's approach of filtering files together. Remember to include undetermined_indices files from lanes that weren't actually multiplexed (or omit such indexes from SampleSheet.csv).
|
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|