Seqanswers Leaderboard Ad

**kmcarr** · 09-26-2011, 07:43 PM

Yes, CASAVA splits them up solely to limit the size of the files. It is primarily done because Illumina's alignment program, Eland, can't deal with larger data sets well. And yes you can simply cat them together. It is possible to simply cat together the gzipped files; some programs will work fine with those but others won't. To be completely safe you should unzip | concatenate | gzip. You can do this all in a single unix pipe

Code:

zcat lane1_NoIndex_L001_R1_00?.fastq.gz | gzip > lane1_NoIndex_L001_R1.fastq.gz

Do likewise for each read from each lane.

**simonandrews** · 09-26-2011, 11:34 PM

It it's any help to anyone the script below can be used to concatenate and filter all of the fastq files from a Casava 1.8 run in one go. You just pass it the full list of all fastq files:

eg: combine_fastq [run_folder]/Unaligned/Project*/Sample*/*fastq.gz

Code:

#!/usr/bin/perl
use warnings;
use strict;

my @files = @ARGV;

my @groups = group_files(@files);

foreach my $group (@groups) {

    warn "Writing to ".$group->{name}."\n";

    open OUT, '>', $group->{name} or die "Can't write to ".$group->name().$!;

    foreach my $file (@{$group->{files}}) {

	warn "Filtering $file\n";
	
	open (IN,"zcat $file |") or die "Can't read from $file: $!";

	while (<IN>) {
	    if (/:Y:/) {
		$_ = <IN>;
		$_ = <IN>;
		$_ = <IN>;
	    }
	    else {
		print OUT;
		print OUT scalar <IN>;
		print OUT scalar <IN>;
		print OUT scalar <IN>;
	    }
	}
    }

    close OUT or die "Failed to write to ".$group->{name}.":$!";

    warn "Compressing ".$group->{name}."\n";

    system("gzip ".$group->{name}) == 0 or die "Failed to compress ".$group->{name}."\n";
}

sub group_files {
    my @files = @_;

    my %groups;

    foreach my $file (@files) {

	my $basename = $file;

	$basename =~ s/_\d{3}\.fastq.gz$/.fastq/;

	if ($basename eq $file) {
	    warn "'file' didn't look like a casava file\n";
	}

	unless (exists $groups{$basename}) {
	    $groups{$basename} = {name => $basename};
	}

	push @{$groups{$basename}->{files}},$file;

    }

    return values %groups;

}

**Howie Goodell** · 10-02-2011, 04:47 PM

Related warning: trying to avoid the need to combine files with a large value of --fastq-cluster-count causes CASAVA 1.8 BCL conversion/demultiplexing -- not just ELAND -- to silently and unpredictably lose data. So I use simonandrew's approach of filtering files together. Remember to include undetermined_indices files from lanes that weren't actually multiplexed (or omit such indexes from SampleSheet.csv).

**kmcarr** · 10-03-2011, 05:54 AM

Originally posted by Howie Goodell View Post

Related warning: trying to avoid the need to combine files with a large value of --fastq-cluster-count causes CASAVA 1.8 BCL conversion/demultiplexing -- not just ELAND -- to silently and unpredictably lose data. So I use simonandrew's approach of filtering files together. Remember to include undetermined_indices files from lanes that weren't actually multiplexed (or omit such indexes from SampleSheet.csv).

Could you please explain this a little more. How did you discover that data was lost?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Question regarding CASAVA output files

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News