SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to do CASAVA alignment by using fastq files weasteam Bioinformatics 2 01-03-2012 12:18 PM
a question about merge bam files camelbbs Bioinformatics 2 10-24-2011 10:00 AM
Question on GTF files gen2prot Bioinformatics 1 12-29-2010 07:54 PM
fastq files generated by Casava-Eland casava Bioinformatics 1 11-19-2010 05:56 AM
Generating a BAM file from Illumina export files in CASAVa 1.7 nirav99 Bioinformatics 1 09-10-2010 02:20 AM

Reply
 
Thread Tools
Old 09-26-2011, 06:17 PM   #1
seqmonkey
Junior Member
 
Location: AZ

Join Date: Sep 2011
Posts: 4
Default Question regarding CASAVA output files

Quick question regarding the output files after CASAVA converts them to fastq (from BCL).

We have data from several lanes, and each lane has its own folder (e.g. Project_FC/Sample_lane1 followed by Sample_lane2, etc...).

In each folder, there are several files. The data is paired reads, so we have R1 and R2 for read 1 and read 2 respectively. However, CASAVA splits each read into separate fastq files numbered sequentially, e.g.:
lane1_NoIndex_L001_R1_001.fastq
lane1_NoIndex_L001_R1_002.fastq
lane1_NoIndex_L001_R1_003.fastq
lane1_NoIndex_L001_R1_004.fastq
lane1_NoIndex_L001_R1_005.fastq
lane1_NoIndex_L001_R1_006.fastq
lane1_NoIndex_L001_R1_007.fastq
lane1_NoIndex_L001_R1_008.fastq
lane1_NoIndex_L001_R1_009.fastq

So in this case, going from 001 to 009. In other lanes, it might go from 001 to 011, and so on. Do you know why the fastq files were split up? Is it purely due to the large file size (and if so, can they simply be "cat" together)?

Thanks!
seqmonkey is offline   Reply With Quote
Old 09-26-2011, 08:43 PM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,153
Default

Yes, CASAVA splits them up solely to limit the size of the files. It is primarily done because Illumina's alignment program, Eland, can't deal with larger data sets well. And yes you can simply cat them together. It is possible to simply cat together the gzipped files; some programs will work fine with those but others won't. To be completely safe you should unzip | concatenate | gzip. You can do this all in a single unix pipe

Code:
zcat lane1_NoIndex_L001_R1_00?.fastq.gz | gzip > lane1_NoIndex_L001_R1.fastq.gz
Do likewise for each read from each lane.
kmcarr is offline   Reply With Quote
Old 09-27-2011, 12:34 AM   #3
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

It it's any help to anyone the script below can be used to concatenate and filter all of the fastq files from a Casava 1.8 run in one go. You just pass it the full list of all fastq files:

eg: combine_fastq [run_folder]/Unaligned/Project*/Sample*/*fastq.gz

Code:
#!/usr/bin/perl
use warnings;
use strict;

my @files = @ARGV;

my @groups = group_files(@files);

foreach my $group (@groups) {

    warn "Writing to ".$group->{name}."\n";

    open OUT, '>', $group->{name} or die "Can't write to ".$group->name().$!;

    foreach my $file (@{$group->{files}}) {

	warn "Filtering $file\n";
	
	open (IN,"zcat $file |") or die "Can't read from $file: $!";

	while (<IN>) {
	    if (/:Y:/) {
		$_ = <IN>;
		$_ = <IN>;
		$_ = <IN>;
	    }
	    else {
		print OUT;
		print OUT scalar <IN>;
		print OUT scalar <IN>;
		print OUT scalar <IN>;
	    }
	}
    }

    close OUT or die "Failed to write to ".$group->{name}.":$!";

    warn "Compressing ".$group->{name}."\n";

    system("gzip ".$group->{name}) == 0 or die "Failed to compress ".$group->{name}."\n";
}

sub group_files {
    my @files = @_;

    my %groups;

    foreach my $file (@files) {

	my $basename = $file;

	$basename =~ s/_\d{3}\.fastq.gz$/.fastq/;

	if ($basename eq $file) {
	    warn "'file' didn't look like a casava file\n";
	}

	unless (exists $groups{$basename}) {
	    $groups{$basename} = {name => $basename};
	}

	push @{$groups{$basename}->{files}},$file;

    }

    return values %groups;

}
simonandrews is offline   Reply With Quote
Old 10-02-2011, 05:47 PM   #4
Howie Goodell
Member
 
Location: Boston, MA

Join Date: Feb 2010
Posts: 10
Default

Related warning: trying to avoid the need to combine files with a large value of --fastq-cluster-count causes CASAVA 1.8 BCL conversion/demultiplexing -- not just ELAND -- to silently and unpredictably lose data. So I use simonandrew's approach of filtering files together. Remember to include undetermined_indices files from lanes that weren't actually multiplexed (or omit such indexes from SampleSheet.csv).
Howie Goodell is offline   Reply With Quote
Old 10-03-2011, 06:54 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,153
Default

Quote:
Originally Posted by Howie Goodell View Post
Related warning: trying to avoid the need to combine files with a large value of --fastq-cluster-count causes CASAVA 1.8 BCL conversion/demultiplexing -- not just ELAND -- to silently and unpredictably lose data. So I use simonandrew's approach of filtering files together. Remember to include undetermined_indices files from lanes that weren't actually multiplexed (or omit such indexes from SampleSheet.csv).
Could you please explain this a little more. How did you discover that data was lost?
kmcarr is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:20 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO