Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question regarding CASAVA output files

    Quick question regarding the output files after CASAVA converts them to fastq (from BCL).

    We have data from several lanes, and each lane has its own folder (e.g. Project_FC/Sample_lane1 followed by Sample_lane2, etc...).

    In each folder, there are several files. The data is paired reads, so we have R1 and R2 for read 1 and read 2 respectively. However, CASAVA splits each read into separate fastq files numbered sequentially, e.g.:
    lane1_NoIndex_L001_R1_001.fastq
    lane1_NoIndex_L001_R1_002.fastq
    lane1_NoIndex_L001_R1_003.fastq
    lane1_NoIndex_L001_R1_004.fastq
    lane1_NoIndex_L001_R1_005.fastq
    lane1_NoIndex_L001_R1_006.fastq
    lane1_NoIndex_L001_R1_007.fastq
    lane1_NoIndex_L001_R1_008.fastq
    lane1_NoIndex_L001_R1_009.fastq

    So in this case, going from 001 to 009. In other lanes, it might go from 001 to 011, and so on. Do you know why the fastq files were split up? Is it purely due to the large file size (and if so, can they simply be "cat" together)?

    Thanks!

  • #2
    Yes, CASAVA splits them up solely to limit the size of the files. It is primarily done because Illumina's alignment program, Eland, can't deal with larger data sets well. And yes you can simply cat them together. It is possible to simply cat together the gzipped files; some programs will work fine with those but others won't. To be completely safe you should unzip | concatenate | gzip. You can do this all in a single unix pipe

    Code:
    zcat lane1_NoIndex_L001_R1_00?.fastq.gz | gzip > lane1_NoIndex_L001_R1.fastq.gz
    Do likewise for each read from each lane.

    Comment


    • #3
      It it's any help to anyone the script below can be used to concatenate and filter all of the fastq files from a Casava 1.8 run in one go. You just pass it the full list of all fastq files:

      eg: combine_fastq [run_folder]/Unaligned/Project*/Sample*/*fastq.gz

      Code:
      #!/usr/bin/perl
      use warnings;
      use strict;
      
      my @files = @ARGV;
      
      my @groups = group_files(@files);
      
      foreach my $group (@groups) {
      
          warn "Writing to ".$group->{name}."\n";
      
          open OUT, '>', $group->{name} or die "Can't write to ".$group->name().$!;
      
          foreach my $file (@{$group->{files}}) {
      
      	warn "Filtering $file\n";
      	
      	open (IN,"zcat $file |") or die "Can't read from $file: $!";
      
      	while (<IN>) {
      	    if (/:Y:/) {
      		$_ = <IN>;
      		$_ = <IN>;
      		$_ = <IN>;
      	    }
      	    else {
      		print OUT;
      		print OUT scalar <IN>;
      		print OUT scalar <IN>;
      		print OUT scalar <IN>;
      	    }
      	}
          }
      
          close OUT or die "Failed to write to ".$group->{name}.":$!";
      
          warn "Compressing ".$group->{name}."\n";
      
          system("gzip ".$group->{name}) == 0 or die "Failed to compress ".$group->{name}."\n";
      }
      
      sub group_files {
          my @files = @_;
      
          my %groups;
      
          foreach my $file (@files) {
      
      	my $basename = $file;
      
      	$basename =~ s/_\d{3}\.fastq.gz$/.fastq/;
      
      	if ($basename eq $file) {
      	    warn "'file' didn't look like a casava file\n";
      	}
      
      	unless (exists $groups{$basename}) {
      	    $groups{$basename} = {name => $basename};
      	}
      
      	push @{$groups{$basename}->{files}},$file;
      
          }
      
          return values %groups;
      
      }

      Comment


      • #4
        Related warning: trying to avoid the need to combine files with a large value of --fastq-cluster-count causes CASAVA 1.8 BCL conversion/demultiplexing -- not just ELAND -- to silently and unpredictably lose data. So I use simonandrew's approach of filtering files together. Remember to include undetermined_indices files from lanes that weren't actually multiplexed (or omit such indexes from SampleSheet.csv).

        Comment


        • #5
          Originally posted by Howie Goodell View Post
          Related warning: trying to avoid the need to combine files with a large value of --fastq-cluster-count causes CASAVA 1.8 BCL conversion/demultiplexing -- not just ELAND -- to silently and unpredictably lose data. So I use simonandrew's approach of filtering files together. Remember to include undetermined_indices files from lanes that weren't actually multiplexed (or omit such indexes from SampleSheet.csv).
          Could you please explain this a little more. How did you discover that data was lost?

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          25 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          27 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X