Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Eugeni
    Junior Member
    • May 2009
    • 4

    FASTQ sequence converter

    Hi everyone
    Does anybody knows any program to convert a sff file from 454 data to fastq format incorporating information of sequence quality??; i have tried fq_all2std.pl from MAQ but does not incorporate information about sequence quality.
    Thanks
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    The MAQ fq_all2std.pl script does not support SFF files.

    The Roche off instrument applications are an "official" way, see:
    Pyrosequencing in picotiter plates, custom arrays for enrichment/decomplexing. (Roche)


    The open source Python tool sff_extract will do it too, it is used in MIRA:


    And in future I expect Biopython will offer this too.

    Update: Biopython 1.54 onwards (April 2010) can read and write SFF files, and this includes converting them to FASTQ, FASTA, QUAL, etc.
    Last edited by maubp; 11-29-2011, 03:45 PM.

    Comment

    • BaCh
      Member
      • May 2008
      • 81

      #3
      Originally posted by maubp View Post
      The open source Python tool sff_extract will do it too, it is used in MIRA:
      Did Jose add a FASTQ option to sff_extract? Last version of sff_extract exported as FASTA + FASTA quality file (which could then be converted into a FASTQ using convert_project from the MIRA package).

      I think I'll quickly add a FASTQ option to sff_extract if it's not already in ... I already played around with the idea for quite a while.

      B.

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        Originally posted by BaCh View Post
        Did Jose add a FASTQ option to sff_extract? Last version of sff_extract exported as FASTA + FASTA quality file (which could then be converted into a FASTQ using convert_project from the MIRA package).

        I think I'll quickly add a FASTQ option to sff_extract if it's not already in ... I already played around with the idea for quite a while.

        B.
        Oh right - I assumed it was all done by sff_extract. But yes, adding FASTQ output would be great.

        Comment

        • kmcarr
          Senior Member
          • May 2008
          • 1181

          #5
          Here is a perl script to convert FASTA + QUAL files to FASTQ. You would need to first generate the FASTA and QUAL files from the SFF file using a tool like sffinfo from Roche or sff_extract.

          Code:
          #!/usr/bin/perl
          
          use warnings;
          use strict;
          use File::Basename;
          
          my $inFasta = $ARGV[0];
          my $baseName = basename($inFasta, qw/.fasta .fna/);
          my $inQual = $baseName . ".qual";
          my $outFastq = $baseName . ".fastq";
          
          my %seqs;
          
          $/ = ">";
          
          open (FASTA, "<$inFasta");
          my $junk = (<FASTA>);
          
          while (my $frecord = <FASTA>) {
          	chomp $frecord;
          	my ($fdef, @seqLines) = split /\n/, $frecord;
          	my $seq = join '', @seqLines;
          	$seqs{$fdef} = $seq;
          }
          
          close FASTA;
          
          open (QUAL, "<$inQual");
          $junk = <QUAL>;
          open (FASTQ, ">$outFastq");
          
          while (my $qrecord = <QUAL>) {
          	chomp $qrecord;
          	my ($qdef, @qualLines) = split /\n/, $qrecord;
          	my $qualString = join ' ', @qualLines;
          	my @quals = split / /, $qualString;
          	print FASTQ "@","$qdef\n";
          	print FASTQ "$seqs{$qdef}\n";
          	print FASTQ "+\n";
          	foreach my $qual (@quals) {
          		print FASTQ chr($qual + 33);
          	}
          	print FASTQ "\n";
          }
          
          close QUAL;
          close FASTQ;
          Usage notes:

          - Run the program just pass it the name of the fasta sequence file, e.g.

          Code:
          %> fastaQual2fastq.pl foo.fasta
          (assuming you saved the above code with the name 'fastaQual2fastq.pl')

          - The fasta filename must end in either .fasta or .fna

          - The quality filename must have the same basename as the fasta file and end with .qual. For example, if your sequence file is "foo.fna" then the quality file must be named "foo.qual".

          NOTE: Look downthread. There is a small bug in this script; a patched version is posted. Thanks to drio for finding the bug.
          Last edited by kmcarr; 10-22-2009, 07:24 PM. Reason: Bug Warning.

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Originally posted by kmcarr View Post
            Here is a perl script to convert FASTA + QUAL files to FASTQ. You would need to first generate the FASTA and QUAL files from the SFF file using a tool like sffinfo from Roche or sff_extract.
            If I'm reading that right (and I am not a Perl coder), you store the entire FASTA file in memory (as sequences keyed by ID), and then loop over the QUAL file. That isn't a great idea with large files.

            I can provide a Python script using Biopython 1.51+ which would be memory efficient for comparison if you like

            But anyway - having SFF to FASTQ handled by sff_extract would be very welcome

            Comment

            • kmcarr
              Senior Member
              • May 2008
              • 1181

              #7
              It was a conscious decision to use the hash in memory for the sequences. I understand that it would be much more efficient, memory wise, to stream the FASTA and QUAL files but there is big assumption to that design -- the order of the sequences in the FASTA and QUAL files must be identical. I wrote the script the way I did because I had a situation where the FASTA and QUAL files were not in the same order. By removing the assumption of identical order this design is more flexible.

              Yes the hash takes memory but by todays standards not that much. It takes ~1.3X the size of your FASTA file. I tested it on a FASTA file from a full 454 Titanium run which contained just under 900,000 sequences. The FASTA file is 296 MB; the memory usage by the script topped out at ~ 400MB. 400MB of RAM is a pittance these days.

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                Originally posted by kmcarr View Post
                It was a conscious decision to use the hash in memory for the sequences. I understand that it would be much more efficient, memory wise, to stream the FASTA and QUAL files but there is big assumption to that design -- the order of the sequences in the FASTA and QUAL files must be identical. I wrote the script the way I did because I had a situation where the FASTA and QUAL files were not in the same order. By removing the assumption of identical order this design is more flexible.
                Good point - although in the case of SFF output, this is a safe assumption to make. In fact, I would have the script check this assumption and verify the FASTA and QUAL files match up (same entries in same order with same sequence lengths).

                How did you get FASTA and QUAL files which were out of sync?

                Originally posted by kmcarr View Post
                Yes the hash takes memory but by todays standards not that much. It takes ~1.3X the size of your FASTA file. I tested it on a FASTA file from a full 454 Titanium run which contained just under 900,000 sequences. The FASTA file is 296 MB; the memory usage by the script topped out at ~ 400MB. 400MB of RAM is a pittance these days.
                Fair enough - assuming the machine isn't doing anything else memory demanding at the same time.

                Comment

                • kmcarr
                  Senior Member
                  • May 2008
                  • 1181

                  #9
                  Originally posted by maubp View Post
                  Good point - although in the case of SFF output, this is a safe assumption to make.
                  Yes, tis true that the output from sffinfo or sff_extract will have the FASTA and QUAL file entries in the same order. If you can always count on that then by all means design your script around that.
                  How did you get FASTA and QUAL files which were out of sync?
                  The sequences were run through the SeqClean cleaning & trimming pipeline first (http://compbio.dfci.harvard.edu/tgi/software/). The final, cleaned FASTA and QUAL files are not matched in terms of order.

                  Comment

                  • maubp
                    Peter (Biopython etc)
                    • Jul 2009
                    • 1544

                    #10
                    Originally posted by kmcarr View Post
                    Yes, tis true that the output from sffinfo or sff_extract will have the FASTA and QUAL file entries in the same order. If you can always count on that then by all means design your script around that.
                    This is also a safe assumption for the Roche FASTA + QUAL files.
                    Originally posted by kmcarr View Post
                    The sequences were run through the SeqClean cleaning & trimming pipeline first (http://compbio.dfci.harvard.edu/tgi/software/). The final, cleaned FASTA and QUAL files are not matched in terms of order.
                    Interesting - I wonder why they do that, and if it would be easy to fix their pipeline...

                    Comment

                    • kmcarr
                      Senior Member
                      • May 2008
                      • 1181

                      #11
                      Originally posted by maubp View Post
                      Interesting - I wonder why they do that, and if it would be easy to fix their pipeline...
                      The pipeline script (seqclean) is written in Perl so you could download it from the link above and check it out.
                      Last edited by kmcarr; 10-07-2009, 09:27 AM. Reason: Removed message text after discovering the cln2qual is perl, not binary.

                      Comment

                      • Eugeni
                        Junior Member
                        • May 2009
                        • 4

                        #12
                        Originally posted by kmcarr View Post
                        Here is a perl script to convert FASTA + QUAL files to FASTQ. You would need to first generate the FASTA and QUAL files from the SFF file using a tool like sffinfo from Roche or sff_extract.

                        Code:
                        #!/usr/bin/perl
                        
                        use warnings;
                        use strict;
                        use File::Basename;
                        
                        my $inFasta = $ARGV[0];
                        my $baseName = basename($inFasta, qw/.fasta .fna/);
                        my $inQual = $baseName . ".qual";
                        my $outFastq = $baseName . ".fastq";
                        
                        my %seqs;
                        
                        $/ = ">";
                        
                        open (FASTA, "<$inFasta");
                        my $junk = (<FASTA>);
                        
                        while (my $frecord = <FASTA>) {
                        	chomp $frecord;
                        	my ($fdef, @seqLines) = split /\n/, $frecord;
                        	my $seq = join '', @seqLines;
                        	$seqs{$fdef} = $seq;
                        }
                        
                        close FASTA;
                        
                        open (QUAL, "<$inQual");
                        $junk = <QUAL>;
                        open (FASTQ, ">$outFastq");
                        
                        while (my $qrecord = <QUAL>) {
                        	chomp $qrecord;
                        	my ($qdef, @qualLines) = split /\n/, $qrecord;
                        	my $qualString = join ' ', @qualLines;
                        	my @quals = split / /, $qualString;
                        	print FASTQ "@","$qdef\n";
                        	print FASTQ "$seqs{$qdef}\n";
                        	print FASTQ "+\n";
                        	foreach my $qual (@quals) {
                        		print FASTQ chr($qual + 33);
                        	}
                        	print FASTQ "\n";
                        }
                        
                        close QUAL;
                        close FASTQ;
                        Usage notes:

                        - Run the program just pass it the name of the fasta sequence file, e.g.

                        Code:
                        %> fastaQual2fastq.pl foo.fasta
                        (assuming you saved the above code with the name 'fastaQual2fastq.pl')

                        - The fasta filename must end in either .fasta or .fna

                        - The quality filename must have the same basename as the fasta file and end with .qual. For example, if your sequence file is "foo.fna" then the quality file must be named "foo.qual".
                        Hi, kmcarr
                        Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
                        Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
                        Dou you know what happens, if it is important?
                        Thanks a lot

                        Comment

                        • kmcarr
                          Senior Member
                          • May 2008
                          • 1181

                          #13
                          Originally posted by Eugeni View Post
                          Hi, kmcarr
                          Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
                          Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
                          Dou you know what happens, if it is important?
                          Thanks a lot
                          Does the warning only appear once? How many entries are in your FASTA/QUAL files?

                          Comment

                          • Eugeni
                            Junior Member
                            • May 2009
                            • 4

                            #14
                            Originally posted by kmcarr View Post
                            Does the warning only appear once? How many entries are in your FASTA/QUAL files?
                            The warning appears associated to all sequences; i have 380185 fasta/qual entries

                            Comment

                            • maubp
                              Peter (Biopython etc)
                              • Jul 2009
                              • 1544

                              #15
                              Just a guess, but you could check your line endings (DOS/Windows versus Unix).

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...