Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question regarding CASAVA output files

    Quick question regarding the output files after CASAVA converts them to fastq (from BCL).

    We have data from several lanes, and each lane has its own folder (e.g. Project_FC/Sample_lane1 followed by Sample_lane2, etc...).

    In each folder, there are several files. The data is paired reads, so we have R1 and R2 for read 1 and read 2 respectively. However, CASAVA splits each read into separate fastq files numbered sequentially, e.g.:
    lane1_NoIndex_L001_R1_001.fastq
    lane1_NoIndex_L001_R1_002.fastq
    lane1_NoIndex_L001_R1_003.fastq
    lane1_NoIndex_L001_R1_004.fastq
    lane1_NoIndex_L001_R1_005.fastq
    lane1_NoIndex_L001_R1_006.fastq
    lane1_NoIndex_L001_R1_007.fastq
    lane1_NoIndex_L001_R1_008.fastq
    lane1_NoIndex_L001_R1_009.fastq

    So in this case, going from 001 to 009. In other lanes, it might go from 001 to 011, and so on. Do you know why the fastq files were split up? Is it purely due to the large file size (and if so, can they simply be "cat" together)?

    Thanks!

  • #2
    Yes, CASAVA splits them up solely to limit the size of the files. It is primarily done because Illumina's alignment program, Eland, can't deal with larger data sets well. And yes you can simply cat them together. It is possible to simply cat together the gzipped files; some programs will work fine with those but others won't. To be completely safe you should unzip | concatenate | gzip. You can do this all in a single unix pipe

    Code:
    zcat lane1_NoIndex_L001_R1_00?.fastq.gz | gzip > lane1_NoIndex_L001_R1.fastq.gz
    Do likewise for each read from each lane.

    Comment


    • #3
      It it's any help to anyone the script below can be used to concatenate and filter all of the fastq files from a Casava 1.8 run in one go. You just pass it the full list of all fastq files:

      eg: combine_fastq [run_folder]/Unaligned/Project*/Sample*/*fastq.gz

      Code:
      #!/usr/bin/perl
      use warnings;
      use strict;
      
      my @files = @ARGV;
      
      my @groups = group_files(@files);
      
      foreach my $group (@groups) {
      
          warn "Writing to ".$group->{name}."\n";
      
          open OUT, '>', $group->{name} or die "Can't write to ".$group->name().$!;
      
          foreach my $file (@{$group->{files}}) {
      
      	warn "Filtering $file\n";
      	
      	open (IN,"zcat $file |") or die "Can't read from $file: $!";
      
      	while (<IN>) {
      	    if (/:Y:/) {
      		$_ = <IN>;
      		$_ = <IN>;
      		$_ = <IN>;
      	    }
      	    else {
      		print OUT;
      		print OUT scalar <IN>;
      		print OUT scalar <IN>;
      		print OUT scalar <IN>;
      	    }
      	}
          }
      
          close OUT or die "Failed to write to ".$group->{name}.":$!";
      
          warn "Compressing ".$group->{name}."\n";
      
          system("gzip ".$group->{name}) == 0 or die "Failed to compress ".$group->{name}."\n";
      }
      
      sub group_files {
          my @files = @_;
      
          my %groups;
      
          foreach my $file (@files) {
      
      	my $basename = $file;
      
      	$basename =~ s/_\d{3}\.fastq.gz$/.fastq/;
      
      	if ($basename eq $file) {
      	    warn "'file' didn't look like a casava file\n";
      	}
      
      	unless (exists $groups{$basename}) {
      	    $groups{$basename} = {name => $basename};
      	}
      
      	push @{$groups{$basename}->{files}},$file;
      
          }
      
          return values %groups;
      
      }

      Comment


      • #4
        Related warning: trying to avoid the need to combine files with a large value of --fastq-cluster-count causes CASAVA 1.8 BCL conversion/demultiplexing -- not just ELAND -- to silently and unpredictably lose data. So I use simonandrew's approach of filtering files together. Remember to include undetermined_indices files from lanes that weren't actually multiplexed (or omit such indexes from SampleSheet.csv).

        Comment


        • #5
          Originally posted by Howie Goodell View Post
          Related warning: trying to avoid the need to combine files with a large value of --fastq-cluster-count causes CASAVA 1.8 BCL conversion/demultiplexing -- not just ELAND -- to silently and unpredictably lose data. So I use simonandrew's approach of filtering files together. Remember to include undetermined_indices files from lanes that weren't actually multiplexed (or omit such indexes from SampleSheet.csv).
          Could you please explain this a little more. How did you discover that data was lost?

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Recent Advances in Sequencing Analysis Tools
            by seqadmin


            The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
            05-06-2024, 07:48 AM
          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 07:03 AM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 05-10-2024, 06:35 AM
          0 responses
          29 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 05-09-2024, 02:46 PM
          0 responses
          36 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 05-07-2024, 06:57 AM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Working...
          X