Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Farhat
    Member
    • Apr 2008
    • 21

    Illumina's PRB file to FastQ

    Does anyone know of any programs that could convert a prb file that records quality scores to a fastQ file.

    Thanks,
    Farhat Habib
  • ECO
    --Site Admin--
    • Oct 2007
    • 1360

    #2
    I haven't gotten it running yet on my system, but Maq seems to include such a functionality:

    Will SolexaPipeline generate the FASTQ file Maq needs? If not, what should I do?
    The Gerald module of the SolexaPipeline generates a FASTQ-like format, but it is not a FASTQ in fact. You should convert the format with "maq sol2sanger". This page gives more explanation. The latest SolexaPipeline-0.3.0 generates a new file in the "Export" format. Read sequences, qualities and most of alignment information is stored in this file
    From here...http://maq.sourceforge.net/qual.shtml

    Comment

    • Farhat
      Member
      • Apr 2008
      • 21

      #3
      Thanks for the information, I will check it out. In the meantime I got a working script from the Sanger Center to do the conversion.
      Farhat Habib

      Comment

      • ECO
        --Site Admin--
        • Oct 2007
        • 1360

        #4
        If you have permission, post it!

        Comment

        • Farhat
          Member
          • Apr 2008
          • 21

          #5
          Originally posted by ECO View Post
          If you have permission, post it!
          Sure.

          There seems to be some problem with my posting it. I have tried several times since yesterday but I get a server unavailable message every time.
          Service Temporarily Unavailable

          The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.
          Last edited by Farhat; 05-23-2008, 07:04 AM.
          Farhat Habib

          Comment

          • ECO
            --Site Admin--
            • Oct 2007
            • 1360

            #6
            Hi Farhat,

            There is a bug in vBulletin and/or my server configuration that crops up with some characters. Try putting it within [ code ] [ / code ] tags before posting it, otherwise you can email it to me. I will PM my email address.

            Comment

            • ECO
              --Site Admin--
              • Oct 2007
              • 1360

              #7
              Script to Convert Illumina to FASTQ

              Code:
              #!/usr/bin/perl -w
              #######################################################################
              # This software has been created by Genome Research Limited (GRL).    # 
              #                                                                     #
              # GRL hereby grants permission to use, copy, modify and distribute    # 
              # this software and its documentation for non-commercial purposes     # 
              # without fee at the user's own risk on the basis set out below.      #
              #                                                                     #
              # GRL neither undertakes nor accepts any duty whether contractual or  # 
              # otherwise in connection with the software, its use or the use of    # 
              # any derivative, and makes no representations or warranties, express #
              # or implied, concerning the software, its suitability, fitness for   #
              # a particular purpose or non-infringement.                           #
              #                                                                     #
              # In no event shall the authors of the software or GRL be responsible # 
              # or liable for any loss or damage whatsoever arising in any way      # 
              # directly or indirectly out of the use of this software or its       # 
              # derivatives, even if advised of the possibility of such damage.     #
              #                                                                     #
              # Our software can be freely distributed under the conditions set out # 
              # above, and must contain this copyright notice.                      #
              #######################################################################
              
              # Author: James Bonfield, March 2006
              #
              # Reads _seq.txt files and _prb.txt/_qcal.txt files in the main
              # Bustard directory to produce a fastq output file. Sequence names are
              # generated based on the input filename and their position in the
              # file. Optionally an additional 'command' may be supplied to run as a
              # filter to modify the name, eg to make the names unique across runs
              #
              # FH 5.21.08 Modified 'directory/filename hackery' to remove underscores 
              
              use strict;
              use Cwd;
              use Getopt::Long;
              
              # Argument parsing
              my %opts = ("trim" => 0, "calibrated" => 0, "quiet" => 0);
              
              $opts{gerald_dir} = [sort glob("GERALD*")]->[-1];
              
              exit unless GetOptions(\%opts, "trim=i", "name_sub=s", "calibrated|c",
                             "gerald_dir=s", "quiet|q");
              
              # Compute log-odds to fastq-scale mapping
              my @rescale;
              for (my $i = -100; $i < 100; $i++) {
                  $rescale[$i+100] = chr (041+int(10*log(1+10**($i/10))/log(10)+.499));
              }
              my %hmap = ('A' => 0, 'C' => 1, 'G' => 2, 'T' => 3, '.' => 0);
              
              # Loop through all aligned sequence records;
              foreach my $fn (@ARGV) {
                  if (-d $fn) {
                  foreach (glob("$fn/*seq.txt")) {
                      process_file($_);
                  }
                  } else {
                  process_file($fn);
                  }
              }
              
              # Processes a single _seq.txt (the filename) and it's corresponding _prb.txt
              # quality file.
              sub process_file {
                  my ($in_fn) = @_;
                  local $"="";
              
                  print STDERR "processing $in_fn\n" unless $opts{quiet};
              
                  # Directory/filename hackery
                  $in_fn = getcwd() . "/$in_fn" if ($in_fn !~ /^\//);
                  my ($dir, $fn) = ($in_fn =~ m:(.*)/(.*):);
              
                  if ($fn !~ /(.*)seq.txt$/) {
                  print STDERR "Filename '$fn' is not a *seq.txt file\n";
                  return;
                  }
              
                  my $base = $1;
              
                  # Default 'op' is based on directory name, or $opts{name_sub}
                  my $old_dir = getcwd();
                  chdir($dir);
                  my @dirs = split('/', getcwd());
              
                  chdir($old_dir);
                  my ($machine, $run) = ("unknown", "unknown");
                  if (exists($dirs[-4]) &&
                  $dirs[-4] =~ m:^[0-9]+_slxa-([^-_]+).*[-_]([0-9]+)$:) {
                  ($machine, $run) = ($1, $2+0);
                  }    
                  my $op;
                  if (exists($opts{name_sub})) {
                  $op = $opts{name_sub};
                  } else {
                  $op = "\"IL${machine}_${run}_\${lane}_\${tile}_\${x}_\${y}\"";
                  }
              
                  # Open the files
                  my $qfn;
                  if ($opts{calibrated}) {
                  $qfn = "$opts{gerald_dir}/${base}_qcal.txt";
                  } else {
                  $qfn = "${base}prb.txt";
                  }
                  my $count = 0;
              
                  open (my $qualfh,  "<", "$dir/$qfn")
                  || die "Couldn't open prb file '$qfn': $@";
                  open (my $seqfh,   "<", "$dir/$fn") 
                  || die "Couldn't open seq file '$fn': $@";
              
                  # Format of seq is <lane> <tile> <xpos> <ypos> <sequence>
                  while (<$seqfh>) {
                  my $q = (<$qualfh>);
              
                  # Decode quality into 1 per called base
                  my ($lane,$tile,$x,$y,$seq) = split(/\s+/,$_);
                  my $i = 0;
                  my @qual;
                  if ($opts{calibrated}) {
                      chomp($q);
                      @qual = map {ord($_) - 64} split("", $q);
                  } else {
                      my @qa = split('\t', $q);
                      @qual = map {[split()]->[$hmap{substr($seq, $i++, 1)}]} @qa;
                  }
              
                  # Map from log-odds to Phred scale
                  @qual = map { $rescale[$_+100] } @qual;
              
                  ++$count;
                  $_ = eval $op;
                  die if $@;
              
                  $seq =~ s/[^ACGT]/N/gi;
                  print "\@$_\n" . substr($seq, $opts{trim}) .
                       "\n+\n" . substr("@qual", $opts{trim}) . "\n";
                  }
              
                  close($seqfh);
                  close($qualfh);
              }
              
              __END__
              
              =head1 NAME
              
              solexa2fastq - converts Bustard seq and prb files into fastq format
              
              =head1 SYNPOSIS
              
               solexa2fastq [options] s_1_1_seq.txt ...
               solexa2fastq [options] bustard_dir ...
              
              =head1 DESCRIPTION
              
              This converts a Bustard output files (I<*_seq.txt> and I<*_prb.txt>,
              and optionally GERALD*/*_qcal.txt) into Fastq format. Only the
              I<seq.txt> files need to be specified on the command line as the
              program will automatically read either the associated I<prb.txt> or
              I<qcal.txt> file. Alternatively specifying a directory acts as a
              synonym for reading all I<*_seq.txt> files from within that
              directory. Any number of files or directories may be specified, with
              all translations being concatenated and written to stdout.
              
              =head1 OPTIONS
              
              =over
              
              =item --trim <integer>
              
              This specifies how many bases to trim from the start of the
              sequence. This now defaults this is "0", but older runs will likely
              require "1". (Should this be specifying use-bases and an
              nYYYYY.. string instead to allow trimming off the other end too?)
              
              =item --name_sub <perlop>
              
              This is a perl expression applied to the generate the reading name.
              It may use the perl variables $machine, $run, $lane, $tile, $x, $y and
              $count with the the first two being computed from the current working
              directory and the latter being an auto-incremented counter. By default
              it uses "IL\${machine}_\${run}_\${lane}_\${tile}_\${x}_\${y}".
              
              For example to use the older solexa2fastq default of
              s_<lane>_<tile>_<counter> you'd specified:
              
               --name_sub '"s_${lane}_${tile}_${count}"'
              
              A more complex example could be:
              
               --name_sub 'sprintf("s_%d_%04d_%X_%X", $run, $tile, $x, $y)'
              
              =item --calibrated
              
              This boolean flag indicates that the quality values should be fetched
              from the *_qcal.txt files in the GERALD directory.
              
              =item --gerald_dir <directory name>
              
              This specifies the GERALD output directory to use with the --calibrated flag.
              Without it the last (sorted by name, not time) one found in the
              directory is used instead. Note that there is an implicit assumption that
              the Bustard directory is 3 levels lower down than our current working
              directory. (This is assumed by when writing the output files to "../../..")
              
              =back
              
              =head1 AUTHOR
              
              James Bonfield
              
              =cut
              Thanks Farhat, got it. Via trial and error that error you were getting was due to the "chr(" on line 49. Adding a space got rid of the error, weird. I've attached the original script to this post also. Please report if it works for you.
              Attached Files

              Comment

              • biocc
                Member
                • Jul 2008
                • 36

                #8
                my result

                Originally posted by Farhat View Post
                Thanks for the information, I will check it out. In the meantime I got a working script from the Sanger Center to do the conversion.
                HI,using your script, I got a result like:
                @ILunknown_unknown_1_1_84_918
                TACTTGTTGATCTATATGTTACTTAATATAATGAAG
                +
                IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII0II%
                @ILunknown_unknown_1_1_334_568
                AGAGCGTGATAGTGCCATGCACAGACCTGTAGATAT
                +
                6EI+-$/I,%>E'8",I(IIB$I&5I,.<&%I&)4"
                I am puzled.and my prb.txt is like this:
                -40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40
                and seq.txt is
                1 1 119 395 GAAGAGGAGATAAATAAAACTCAAAATACAGCTGAA

                Comment

                • kmcarr
                  Senior Member
                  • May 2008
                  • 1181

                  #9
                  Besides the solexa2fastq.pl script you can also use the script fq_all2std.pl. This script is part of the standard MAQ distribution (look in the scripts/ subdirectory). It handles conversion of just about every Solexa file format to FASTA/FASTQ, including the .seq and .prb files.

                  Comment

                  • swbarnes2
                    Senior Member
                    • May 2008
                    • 910

                    #10
                    Originally posted by biocc View Post
                    HI,using your script, I got a result like:
                    @ILunknown_unknown_1_1_84_918
                    TACTTGTTGATCTATATGTTACTTAATATAATGAAG
                    +
                    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII0II%
                    @ILunknown_unknown_1_1_334_568
                    AGAGCGTGATAGTGCCATGCACAGACCTGTAGATAT
                    +
                    6EI+-$/I,%>E'8",I(IIB$I&5I,.<&%I&)4"
                    I am puzled.and my prb.txt is like this:
                    -40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40
                    and seq.txt is
                    1 1 119 395 GAAGAGGAGATAAATAAAACTCAAAATACAGCTGAA
                    That's all correct.

                    @name_of_sequence
                    sequence
                    +name_of _sequence, or left blank to keep the file smaller
                    quality of each letter

                    Is the fastq format. The highest number of the group of 4 is the quality score of that base (its position in the 4 tells you which letter it is). Then an ASCII letter stands for that number. From the letters you've got there, ! means a quality of 0, and I is 40, the best quality.

                    The unknowns are the machine and the date, but the rest of the name is the lane, the tile, and the X + Y coordiantes of the cluster.

                    Comment

                    • jkbonfield
                      Senior Member
                      • Jul 2008
                      • 146

                      #11
                      The original doesn't seem to have that space after chr so it's likely some problem with mailing it or some daft word wrapping thing in a mail client.

                      If you're in the run folder then the automatically generated name should work, but it checks pwd to figure that out so copying the files elsewhere would break that. There's also some hideous hackery of supplying an anonymous function to generate the sequence names. When this was written Illumina hadn't produced any fastq files at all and even eland output didn't have sequence nanes, so we were a bit in the dark as to what names would look like. Hence the overcomplicated method in there :-)

                      James

                      Comment

                      • olus
                        Member
                        • Aug 2008
                        • 22

                        #12
                        Perhaps I've missed it, but nobody cited the 'ShortRead' package in Bioconductor.
                        Thought I didn't use it yet (hope to do so very soon ;-)) the manual claims that:
                        "ShortRead includes additional functions to facilitate input. For instance, read-
                        Prb reads Solexa _prb.txt files. These files contain base-specific quality in-
                        formation, and readPrb returns an SFastqQuality-class object representing the
                        fastq-encoded base-speci c quality scores of all reads."

                        Cheers
                        gabriele bucci

                        Comment

                        • hannat
                          Member
                          • Jan 2009
                          • 16

                          #13
                          seq.txt
                          .................................................. ......................
                          6 1 914 893 GCTACTGCCGTGACCTCATTTCTCTTA
                          6 1 898 905 GAAAAAGAGAAAGTTTAGGAGATCGAT
                          .................................................. ...................................
                          prob.txt
                          .................................................. ...................................
                          -30 -30 30 -30 -30 30 -30 -30 -30 -30 -30 30 30 -30 -30
                          -30 -30 30 -30 -30 -30 -30 -30 30 -30 -30 30 -30 -30 3
                          0 -30 -30 -30 30 -30 -30 -30 -30 30 -30 -30 -30 -30 30
                          -30 -30 30 -30 30 -30 -30 -30 -30 30 -30 -30 -30 30 -30
                          -30 -30 -30 -30...
                          .................................................. ...........................

                          but when i run
                          fq_all2std.pl seqprb2std seq.txt prb.txt
                          The output is following,
                          ...........................................
                          @6:1:914:893
                          GCTACTGCCGTGACCTCATTTCTCTTA
                          +
                          ???????????????????????????
                          ..................................................

                          but i have lots warning messages, like following:

                          Use of uninitialized value in array element at solexa2fastq.pl line 128, <$qualfh> line 6609.
                          Argument "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..." isn't numeric in addition (+) at solexa2fastq.pl line 132, <$qualfh> line 6609.

                          i wonder if any of you had met similar problems before, if so, where do you think the problem is?
                          now i am checking prb.txt, hope i can find out whats wrong with my input.
                          Last edited by hannat; 01-26-2009, 04:02 AM.

                          Comment

                          • robsyme
                            Junior Member
                            • Jan 2009
                            • 6

                            #14
                            fq_all2std.pl Errors.

                            UPDATE: My simple error, the prb had somehow become truncated. Should have checked before posting, my apologies.

                            I tried maq's fq_all2std.pl script, but having initial problems

                            Executing
                            Code:
                            fq_all2std.pl seqprb2std foo.seq.txt bar.prb.txt > sn4.fastq
                            seems to run well for a while, but then a get a flood of
                            Code:
                            ..
                            Use of uninitialized value $_ in split at fq_all2std.pl line 148, <$fhq> line 1132267
                            Use of uninitialized value $_ in split at fq_all2std.pl line 148, <$fhq> line 1132267
                            Use of uninitialized value $_ in split at fq_all2std.pl line 148, <$fhq> line 1132267
                            ..
                            I'm pretty sure those blank lines are supposed to filled with quality information.
                            Last edited by robsyme; 05-23-2009, 03:49 AM. Reason: My error - file had become truncated

                            Comment

                            • Oxygen81
                              Member
                              • Aug 2009
                              • 11

                              #15
                              Hello everybody,

                              like some others here I have some problems to convert my Solexa reads into an appropriate format for Maq.
                              I tried the scipts fq_all2std.pl and the Script of Farhad (#7) but nothing is working. Here is an example of my files:
                              *.seq.txt
                              >KN-964_02-2424_07-03179_1_1_1_1001_230
                              GTGCTAA ... (36 sites)

                              *.qual.txt
                              >KN-964_02-2424_07-03179_1_1_1_1001_230
                              32 32 32 32 32 32 32 ...

                              The sequencing was done in 2007. The command seqprb2std is not contained in fq_all2std.pl, is it? We did an other sequencing (paired-end) in 2008/2009 where I got a new format:
                              >KN-1413_07-00010:6:1:727:174/1: GTGTG...:40 40 40 40 40

                              First I thought this is the new Illumina 1.3 format and tried to convert it with a script but it doesn't work, too.

                              So what can I do? Could someone give me an example of a working input and tell me what I have to prompt in the command line exactly.

                              Thank you very much and a nice weekend to all!
                              Janina

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              8 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              30 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...