View Single Post
Old 11-11-2010, 11:26 PM   #8
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

If it's useful to anyone this is a small script I knocked up when we had to process some fastq files which were corrupted during an FTP transfer. You can pipe data through it and it does some basic sanity checks to ensure that the file looks like valid fastq data. It will remove any entries which look broken and leave you just the good stuff.

Code:
#!/usr/bin/perl
use warnings;
use strict;

while (<>) {

  unless (/^\@/) {
    warn "$_ should have had an \@ at the start and it didn't\n";
    next;
  }
  my $id1 = $_;
  my $seq = <>;
  my $id2 = <>;
  my $qual = <>;

  if ($seq =~/^[@+]/) {
    warn "Sequence '$seq' looked like an id";
    next;
  }
  if ($qual =~/^[@+]/) {
    warn "Quality '$qual' looked like an id";
    next;
  }
  if ($id2 !~ /^\+/) {
    warn "Midline '$id2' didn't start with a +";
    next;
  }

  if ($qual =~ /[GATCN]{20,}/) {
    warn "Quality '$qual' looked like sequence";
    next;
  }

  if (length($seq) != length($qual)) {
    warn "Seq $seq and Qual $qual weren't the same length";
    next;
  }

  print $id1,$seq,$id2,$qual;


}
simonandrews is offline   Reply With Quote