Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina's PRB file to FastQ

    Does anyone know of any programs that could convert a prb file that records quality scores to a fastQ file.

    Thanks,
    Farhat Habib

  • #2
    I haven't gotten it running yet on my system, but Maq seems to include such a functionality:

    Will SolexaPipeline generate the FASTQ file Maq needs? If not, what should I do?
    The Gerald module of the SolexaPipeline generates a FASTQ-like format, but it is not a FASTQ in fact. You should convert the format with "maq sol2sanger". This page gives more explanation. The latest SolexaPipeline-0.3.0 generates a new file in the "Export" format. Read sequences, qualities and most of alignment information is stored in this file
    From here...http://maq.sourceforge.net/qual.shtml

    Comment


    • #3
      Thanks for the information, I will check it out. In the meantime I got a working script from the Sanger Center to do the conversion.
      Farhat Habib

      Comment


      • #4
        If you have permission, post it!

        Comment


        • #5
          Originally posted by ECO View Post
          If you have permission, post it!
          Sure.

          There seems to be some problem with my posting it. I have tried several times since yesterday but I get a server unavailable message every time.
          Service Temporarily Unavailable

          The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.
          Last edited by Farhat; 05-23-2008, 07:04 AM.
          Farhat Habib

          Comment


          • #6
            Hi Farhat,

            There is a bug in vBulletin and/or my server configuration that crops up with some characters. Try putting it within [ code ] [ / code ] tags before posting it, otherwise you can email it to me. I will PM my email address.

            Comment


            • #7
              Script to Convert Illumina to FASTQ

              Code:
              #!/usr/bin/perl -w
              #######################################################################
              # This software has been created by Genome Research Limited (GRL).    # 
              #                                                                     #
              # GRL hereby grants permission to use, copy, modify and distribute    # 
              # this software and its documentation for non-commercial purposes     # 
              # without fee at the user's own risk on the basis set out below.      #
              #                                                                     #
              # GRL neither undertakes nor accepts any duty whether contractual or  # 
              # otherwise in connection with the software, its use or the use of    # 
              # any derivative, and makes no representations or warranties, express #
              # or implied, concerning the software, its suitability, fitness for   #
              # a particular purpose or non-infringement.                           #
              #                                                                     #
              # In no event shall the authors of the software or GRL be responsible # 
              # or liable for any loss or damage whatsoever arising in any way      # 
              # directly or indirectly out of the use of this software or its       # 
              # derivatives, even if advised of the possibility of such damage.     #
              #                                                                     #
              # Our software can be freely distributed under the conditions set out # 
              # above, and must contain this copyright notice.                      #
              #######################################################################
              
              # Author: James Bonfield, March 2006
              #
              # Reads _seq.txt files and _prb.txt/_qcal.txt files in the main
              # Bustard directory to produce a fastq output file. Sequence names are
              # generated based on the input filename and their position in the
              # file. Optionally an additional 'command' may be supplied to run as a
              # filter to modify the name, eg to make the names unique across runs
              #
              # FH 5.21.08 Modified 'directory/filename hackery' to remove underscores 
              
              use strict;
              use Cwd;
              use Getopt::Long;
              
              # Argument parsing
              my %opts = ("trim" => 0, "calibrated" => 0, "quiet" => 0);
              
              $opts{gerald_dir} = [sort glob("GERALD*")]->[-1];
              
              exit unless GetOptions(\%opts, "trim=i", "name_sub=s", "calibrated|c",
                             "gerald_dir=s", "quiet|q");
              
              # Compute log-odds to fastq-scale mapping
              my @rescale;
              for (my $i = -100; $i < 100; $i++) {
                  $rescale[$i+100] = chr (041+int(10*log(1+10**($i/10))/log(10)+.499));
              }
              my %hmap = ('A' => 0, 'C' => 1, 'G' => 2, 'T' => 3, '.' => 0);
              
              # Loop through all aligned sequence records;
              foreach my $fn (@ARGV) {
                  if (-d $fn) {
                  foreach (glob("$fn/*seq.txt")) {
                      process_file($_);
                  }
                  } else {
                  process_file($fn);
                  }
              }
              
              # Processes a single _seq.txt (the filename) and it's corresponding _prb.txt
              # quality file.
              sub process_file {
                  my ($in_fn) = @_;
                  local $"="";
              
                  print STDERR "processing $in_fn\n" unless $opts{quiet};
              
                  # Directory/filename hackery
                  $in_fn = getcwd() . "/$in_fn" if ($in_fn !~ /^\//);
                  my ($dir, $fn) = ($in_fn =~ m:(.*)/(.*):);
              
                  if ($fn !~ /(.*)seq.txt$/) {
                  print STDERR "Filename '$fn' is not a *seq.txt file\n";
                  return;
                  }
              
                  my $base = $1;
              
                  # Default 'op' is based on directory name, or $opts{name_sub}
                  my $old_dir = getcwd();
                  chdir($dir);
                  my @dirs = split('/', getcwd());
              
                  chdir($old_dir);
                  my ($machine, $run) = ("unknown", "unknown");
                  if (exists($dirs[-4]) &&
                  $dirs[-4] =~ m:^[0-9]+_slxa-([^-_]+).*[-_]([0-9]+)$:) {
                  ($machine, $run) = ($1, $2+0);
                  }    
                  my $op;
                  if (exists($opts{name_sub})) {
                  $op = $opts{name_sub};
                  } else {
                  $op = "\"IL${machine}_${run}_\${lane}_\${tile}_\${x}_\${y}\"";
                  }
              
                  # Open the files
                  my $qfn;
                  if ($opts{calibrated}) {
                  $qfn = "$opts{gerald_dir}/${base}_qcal.txt";
                  } else {
                  $qfn = "${base}prb.txt";
                  }
                  my $count = 0;
              
                  open (my $qualfh,  "<", "$dir/$qfn")
                  || die "Couldn't open prb file '$qfn': $@";
                  open (my $seqfh,   "<", "$dir/$fn") 
                  || die "Couldn't open seq file '$fn': $@";
              
                  # Format of seq is <lane> <tile> <xpos> <ypos> <sequence>
                  while (<$seqfh>) {
                  my $q = (<$qualfh>);
              
                  # Decode quality into 1 per called base
                  my ($lane,$tile,$x,$y,$seq) = split(/\s+/,$_);
                  my $i = 0;
                  my @qual;
                  if ($opts{calibrated}) {
                      chomp($q);
                      @qual = map {ord($_) - 64} split("", $q);
                  } else {
                      my @qa = split('\t', $q);
                      @qual = map {[split()]->[$hmap{substr($seq, $i++, 1)}]} @qa;
                  }
              
                  # Map from log-odds to Phred scale
                  @qual = map { $rescale[$_+100] } @qual;
              
                  ++$count;
                  $_ = eval $op;
                  die if $@;
              
                  $seq =~ s/[^ACGT]/N/gi;
                  print "\@$_\n" . substr($seq, $opts{trim}) .
                       "\n+\n" . substr("@qual", $opts{trim}) . "\n";
                  }
              
                  close($seqfh);
                  close($qualfh);
              }
              
              __END__
              
              =head1 NAME
              
              solexa2fastq - converts Bustard seq and prb files into fastq format
              
              =head1 SYNPOSIS
              
               solexa2fastq [options] s_1_1_seq.txt ...
               solexa2fastq [options] bustard_dir ...
              
              =head1 DESCRIPTION
              
              This converts a Bustard output files (I<*_seq.txt> and I<*_prb.txt>,
              and optionally GERALD*/*_qcal.txt) into Fastq format. Only the
              I<seq.txt> files need to be specified on the command line as the
              program will automatically read either the associated I<prb.txt> or
              I<qcal.txt> file. Alternatively specifying a directory acts as a
              synonym for reading all I<*_seq.txt> files from within that
              directory. Any number of files or directories may be specified, with
              all translations being concatenated and written to stdout.
              
              =head1 OPTIONS
              
              =over
              
              =item --trim <integer>
              
              This specifies how many bases to trim from the start of the
              sequence. This now defaults this is "0", but older runs will likely
              require "1". (Should this be specifying use-bases and an
              nYYYYY.. string instead to allow trimming off the other end too?)
              
              =item --name_sub <perlop>
              
              This is a perl expression applied to the generate the reading name.
              It may use the perl variables $machine, $run, $lane, $tile, $x, $y and
              $count with the the first two being computed from the current working
              directory and the latter being an auto-incremented counter. By default
              it uses "IL\${machine}_\${run}_\${lane}_\${tile}_\${x}_\${y}".
              
              For example to use the older solexa2fastq default of
              s_<lane>_<tile>_<counter> you'd specified:
              
               --name_sub '"s_${lane}_${tile}_${count}"'
              
              A more complex example could be:
              
               --name_sub 'sprintf("s_%d_%04d_%X_%X", $run, $tile, $x, $y)'
              
              =item --calibrated
              
              This boolean flag indicates that the quality values should be fetched
              from the *_qcal.txt files in the GERALD directory.
              
              =item --gerald_dir <directory name>
              
              This specifies the GERALD output directory to use with the --calibrated flag.
              Without it the last (sorted by name, not time) one found in the
              directory is used instead. Note that there is an implicit assumption that
              the Bustard directory is 3 levels lower down than our current working
              directory. (This is assumed by when writing the output files to "../../..")
              
              =back
              
              =head1 AUTHOR
              
              James Bonfield
              
              =cut
              Thanks Farhat, got it. Via trial and error that error you were getting was due to the "chr(" on line 49. Adding a space got rid of the error, weird. I've attached the original script to this post also. Please report if it works for you.
              Attached Files

              Comment


              • #8
                my result

                Originally posted by Farhat View Post
                Thanks for the information, I will check it out. In the meantime I got a working script from the Sanger Center to do the conversion.
                HI,using your script, I got a result like:
                @ILunknown_unknown_1_1_84_918
                TACTTGTTGATCTATATGTTACTTAATATAATGAAG
                +
                IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII0II%
                @ILunknown_unknown_1_1_334_568
                AGAGCGTGATAGTGCCATGCACAGACCTGTAGATAT
                +
                6EI+-$/I,%>E'8",I(IIB$I&5I,.<&%I&)4"
                I am puzled.and my prb.txt is like this:
                -40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40
                and seq.txt is
                1 1 119 395 GAAGAGGAGATAAATAAAACTCAAAATACAGCTGAA

                Comment


                • #9
                  Besides the solexa2fastq.pl script you can also use the script fq_all2std.pl. This script is part of the standard MAQ distribution (look in the scripts/ subdirectory). It handles conversion of just about every Solexa file format to FASTA/FASTQ, including the .seq and .prb files.

                  Comment


                  • #10
                    Originally posted by biocc View Post
                    HI,using your script, I got a result like:
                    @ILunknown_unknown_1_1_84_918
                    TACTTGTTGATCTATATGTTACTTAATATAATGAAG
                    +
                    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII0II%
                    @ILunknown_unknown_1_1_334_568
                    AGAGCGTGATAGTGCCATGCACAGACCTGTAGATAT
                    +
                    6EI+-$/I,%>E'8",I(IIB$I&5I,.<&%I&)4"
                    I am puzled.and my prb.txt is like this:
                    -40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40
                    and seq.txt is
                    1 1 119 395 GAAGAGGAGATAAATAAAACTCAAAATACAGCTGAA
                    That's all correct.

                    @name_of_sequence
                    sequence
                    +name_of _sequence, or left blank to keep the file smaller
                    quality of each letter

                    Is the fastq format. The highest number of the group of 4 is the quality score of that base (its position in the 4 tells you which letter it is). Then an ASCII letter stands for that number. From the letters you've got there, ! means a quality of 0, and I is 40, the best quality.

                    The unknowns are the machine and the date, but the rest of the name is the lane, the tile, and the X + Y coordiantes of the cluster.

                    Comment


                    • #11
                      The original doesn't seem to have that space after chr so it's likely some problem with mailing it or some daft word wrapping thing in a mail client.

                      If you're in the run folder then the automatically generated name should work, but it checks pwd to figure that out so copying the files elsewhere would break that. There's also some hideous hackery of supplying an anonymous function to generate the sequence names. When this was written Illumina hadn't produced any fastq files at all and even eland output didn't have sequence nanes, so we were a bit in the dark as to what names would look like. Hence the overcomplicated method in there :-)

                      James

                      Comment


                      • #12
                        Perhaps I've missed it, but nobody cited the 'ShortRead' package in Bioconductor.
                        Thought I didn't use it yet (hope to do so very soon ;-)) the manual claims that:
                        "ShortRead includes additional functions to facilitate input. For instance, read-
                        Prb reads Solexa _prb.txt files. These files contain base-specific quality in-
                        formation, and readPrb returns an SFastqQuality-class object representing the
                        fastq-encoded base-speci c quality scores of all reads."

                        Cheers
                        gabriele bucci

                        Comment


                        • #13
                          seq.txt
                          .................................................. ......................
                          6 1 914 893 GCTACTGCCGTGACCTCATTTCTCTTA
                          6 1 898 905 GAAAAAGAGAAAGTTTAGGAGATCGAT
                          .................................................. ...................................
                          prob.txt
                          .................................................. ...................................
                          -30 -30 30 -30 -30 30 -30 -30 -30 -30 -30 30 30 -30 -30
                          -30 -30 30 -30 -30 -30 -30 -30 30 -30 -30 30 -30 -30 3
                          0 -30 -30 -30 30 -30 -30 -30 -30 30 -30 -30 -30 -30 30
                          -30 -30 30 -30 30 -30 -30 -30 -30 30 -30 -30 -30 30 -30
                          -30 -30 -30 -30...
                          .................................................. ...........................

                          but when i run
                          fq_all2std.pl seqprb2std seq.txt prb.txt
                          The output is following,
                          ...........................................
                          @6:1:914:893
                          GCTACTGCCGTGACCTCATTTCTCTTA
                          +
                          ???????????????????????????
                          ..................................................

                          but i have lots warning messages, like following:

                          Use of uninitialized value in array element at solexa2fastq.pl line 128, <$qualfh> line 6609.
                          Argument "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..." isn't numeric in addition (+) at solexa2fastq.pl line 132, <$qualfh> line 6609.

                          i wonder if any of you had met similar problems before, if so, where do you think the problem is?
                          now i am checking prb.txt, hope i can find out whats wrong with my input.
                          Last edited by hannat; 01-26-2009, 04:02 AM.

                          Comment


                          • #14
                            fq_all2std.pl Errors.

                            UPDATE: My simple error, the prb had somehow become truncated. Should have checked before posting, my apologies.

                            I tried maq's fq_all2std.pl script, but having initial problems

                            Executing
                            Code:
                            fq_all2std.pl seqprb2std foo.seq.txt bar.prb.txt > sn4.fastq
                            seems to run well for a while, but then a get a flood of
                            Code:
                            ..
                            Use of uninitialized value $_ in split at fq_all2std.pl line 148, <$fhq> line 1132267
                            Use of uninitialized value $_ in split at fq_all2std.pl line 148, <$fhq> line 1132267
                            Use of uninitialized value $_ in split at fq_all2std.pl line 148, <$fhq> line 1132267
                            ..
                            I'm pretty sure those blank lines are supposed to filled with quality information.
                            Last edited by robsyme; 05-23-2009, 03:49 AM. Reason: My error - file had become truncated

                            Comment


                            • #15
                              Hello everybody,

                              like some others here I have some problems to convert my Solexa reads into an appropriate format for Maq.
                              I tried the scipts fq_all2std.pl and the Script of Farhad (#7) but nothing is working. Here is an example of my files:
                              *.seq.txt
                              >KN-964_02-2424_07-03179_1_1_1_1001_230
                              GTGCTAA ... (36 sites)

                              *.qual.txt
                              >KN-964_02-2424_07-03179_1_1_1_1001_230
                              32 32 32 32 32 32 32 ...

                              The sequencing was done in 2007. The command seqprb2std is not contained in fq_all2std.pl, is it? We did an other sequencing (paired-end) in 2008/2009 where I got a new format:
                              >KN-1413_07-00010:6:1:727:174/1: GTGTG...:40 40 40 40 40

                              First I thought this is the new Illumina 1.3 format and tried to convert it with a script but it doesn't work, too.

                              So what can I do? Could someone give me an example of a working input and tell me what I have to prompt in the command line exactly.

                              Thank you very much and a nice weekend to all!
                              Janina

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X