SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SAM to FASTQ converter - Picard sdm Bioinformatics 14 03-19-2013 11:19 AM
Problems with the illumina .fastq sequence data annotation tractorsazi Bioinformatics 3 01-30-2012 07:50 AM
Sorting fastq by primers, then searching by sequence (with mismatches) jme Bioinformatics 0 01-18-2012 10:25 AM
dna sequence to protein converter wijerasa Bioinformatics 1 01-14-2011 12:40 AM
Converting Solexa FASTQ file to unique sequence tags DrD2009 Bioinformatics 16 08-09-2010 12:30 AM

Reply
 
Thread Tools
Old 10-07-2009, 04:03 AM   #1
Eugeni
Junior Member
 
Location: spain

Join Date: May 2009
Posts: 4
Default FASTQ sequence converter

Hi everyone
Does anybody knows any program to convert a sff file from 454 data to fastq format incorporating information of sequence quality??; i have tried fq_all2std.pl from MAQ but does not incorporate information about sequence quality.
Thanks
Eugeni is offline   Reply With Quote
Old 10-07-2009, 05:27 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

The MAQ fq_all2std.pl script does not support SFF files.

The Roche off instrument applications are an "official" way, see:
http://seqanswers.com/forums/showthread.php?t=114

The open source Python tool sff_extract will do it too, it is used in MIRA:
http://bioinf.comav.upv.es/sff_extract/index.html

And in future I expect Biopython will offer this too.

Update: Biopython 1.54 onwards (April 2010) can read and write SFF files, and this includes converting them to FASTQ, FASTA, QUAL, etc.
http://news.open-bio.org/news/2010/0...n-release-154/

Last edited by maubp; 11-29-2011 at 03:45 PM.
maubp is offline   Reply With Quote
Old 10-07-2009, 05:41 AM   #3
BaCh
Member
 
Location: Germany

Join Date: May 2008
Posts: 79
Default

Quote:
Originally Posted by maubp View Post
The open source Python tool sff_extract will do it too, it is used in MIRA:
http://bioinf.comav.upv.es/sff_extract/index.html
Did Jose add a FASTQ option to sff_extract? Last version of sff_extract exported as FASTA + FASTA quality file (which could then be converted into a FASTQ using convert_project from the MIRA package).

I think I'll quickly add a FASTQ option to sff_extract if it's not already in ... I already played around with the idea for quite a while.

B.
BaCh is offline   Reply With Quote
Old 10-07-2009, 05:46 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Quote:
Originally Posted by BaCh View Post
Did Jose add a FASTQ option to sff_extract? Last version of sff_extract exported as FASTA + FASTA quality file (which could then be converted into a FASTQ using convert_project from the MIRA package).

I think I'll quickly add a FASTQ option to sff_extract if it's not already in ... I already played around with the idea for quite a while.

B.
Oh right - I assumed it was all done by sff_extract. But yes, adding FASTQ output would be great.
maubp is offline   Reply With Quote
Old 10-07-2009, 05:55 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

Here is a perl script to convert FASTA + QUAL files to FASTQ. You would need to first generate the FASTA and QUAL files from the SFF file using a tool like sffinfo from Roche or sff_extract.

Code:
#!/usr/bin/perl

use warnings;
use strict;
use File::Basename;

my $inFasta = $ARGV[0];
my $baseName = basename($inFasta, qw/.fasta .fna/);
my $inQual = $baseName . ".qual";
my $outFastq = $baseName . ".fastq";

my %seqs;

$/ = ">";

open (FASTA, "<$inFasta");
my $junk = (<FASTA>);

while (my $frecord = <FASTA>) {
	chomp $frecord;
	my ($fdef, @seqLines) = split /\n/, $frecord;
	my $seq = join '', @seqLines;
	$seqs{$fdef} = $seq;
}

close FASTA;

open (QUAL, "<$inQual");
$junk = <QUAL>;
open (FASTQ, ">$outFastq");

while (my $qrecord = <QUAL>) {
	chomp $qrecord;
	my ($qdef, @qualLines) = split /\n/, $qrecord;
	my $qualString = join ' ', @qualLines;
	my @quals = split / /, $qualString;
	print FASTQ "@","$qdef\n";
	print FASTQ "$seqs{$qdef}\n";
	print FASTQ "+\n";
	foreach my $qual (@quals) {
		print FASTQ chr($qual + 33);
	}
	print FASTQ "\n";
}

close QUAL;
close FASTQ;
Usage notes:

- Run the program just pass it the name of the fasta sequence file, e.g.

Code:
%> fastaQual2fastq.pl foo.fasta
(assuming you saved the above code with the name 'fastaQual2fastq.pl')

- The fasta filename must end in either .fasta or .fna

- The quality filename must have the same basename as the fasta file and end with .qual. For example, if your sequence file is "foo.fna" then the quality file must be named "foo.qual".

NOTE: Look downthread. There is a small bug in this script; a patched version is posted. Thanks to drio for finding the bug.

Last edited by kmcarr; 10-22-2009 at 08:24 PM. Reason: Bug Warning.
kmcarr is offline   Reply With Quote
Old 10-07-2009, 06:05 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Quote:
Originally Posted by kmcarr View Post
Here is a perl script to convert FASTA + QUAL files to FASTQ. You would need to first generate the FASTA and QUAL files from the SFF file using a tool like sffinfo from Roche or sff_extract.
If I'm reading that right (and I am not a Perl coder), you store the entire FASTA file in memory (as sequences keyed by ID), and then loop over the QUAL file. That isn't a great idea with large files.

I can provide a Python script using Biopython 1.51+ which would be memory efficient for comparison if you like

But anyway - having SFF to FASTQ handled by sff_extract would be very welcome
maubp is offline   Reply With Quote
Old 10-07-2009, 06:44 AM   #7
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

It was a conscious decision to use the hash in memory for the sequences. I understand that it would be much more efficient, memory wise, to stream the FASTA and QUAL files but there is big assumption to that design -- the order of the sequences in the FASTA and QUAL files must be identical. I wrote the script the way I did because I had a situation where the FASTA and QUAL files were not in the same order. By removing the assumption of identical order this design is more flexible.

Yes the hash takes memory but by todays standards not that much. It takes ~1.3X the size of your FASTA file. I tested it on a FASTA file from a full 454 Titanium run which contained just under 900,000 sequences. The FASTA file is 296 MB; the memory usage by the script topped out at ~ 400MB. 400MB of RAM is a pittance these days.
kmcarr is offline   Reply With Quote
Old 10-07-2009, 06:54 AM   #8
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Quote:
Originally Posted by kmcarr View Post
It was a conscious decision to use the hash in memory for the sequences. I understand that it would be much more efficient, memory wise, to stream the FASTA and QUAL files but there is big assumption to that design -- the order of the sequences in the FASTA and QUAL files must be identical. I wrote the script the way I did because I had a situation where the FASTA and QUAL files were not in the same order. By removing the assumption of identical order this design is more flexible.
Good point - although in the case of SFF output, this is a safe assumption to make. In fact, I would have the script check this assumption and verify the FASTA and QUAL files match up (same entries in same order with same sequence lengths).

How did you get FASTA and QUAL files which were out of sync?

Quote:
Originally Posted by kmcarr View Post
Yes the hash takes memory but by todays standards not that much. It takes ~1.3X the size of your FASTA file. I tested it on a FASTA file from a full 454 Titanium run which contained just under 900,000 sequences. The FASTA file is 296 MB; the memory usage by the script topped out at ~ 400MB. 400MB of RAM is a pittance these days.
Fair enough - assuming the machine isn't doing anything else memory demanding at the same time.
maubp is offline   Reply With Quote
Old 10-07-2009, 09:42 AM   #9
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

Quote:
Originally Posted by maubp View Post
Good point - although in the case of SFF output, this is a safe assumption to make.
Yes, tis true that the output from sffinfo or sff_extract will have the FASTA and QUAL file entries in the same order. If you can always count on that then by all means design your script around that.
Quote:
How did you get FASTA and QUAL files which were out of sync?
The sequences were run through the SeqClean cleaning & trimming pipeline first (http://compbio.dfci.harvard.edu/tgi/software/). The final, cleaned FASTA and QUAL files are not matched in terms of order.
kmcarr is offline   Reply With Quote
Old 10-07-2009, 09:46 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Quote:
Originally Posted by kmcarr View Post
Yes, tis true that the output from sffinfo or sff_extract will have the FASTA and QUAL file entries in the same order. If you can always count on that then by all means design your script around that.
This is also a safe assumption for the Roche FASTA + QUAL files.
Quote:
Originally Posted by kmcarr View Post
The sequences were run through the SeqClean cleaning & trimming pipeline first (http://compbio.dfci.harvard.edu/tgi/software/). The final, cleaned FASTA and QUAL files are not matched in terms of order.
Interesting - I wonder why they do that, and if it would be easy to fix their pipeline...
maubp is offline   Reply With Quote
Old 10-07-2009, 10:03 AM   #11
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

Quote:
Originally Posted by maubp View Post
Interesting - I wonder why they do that, and if it would be easy to fix their pipeline...
The pipeline script (seqclean) is written in Perl so you could download it from the link above and check it out.

Last edited by kmcarr; 10-07-2009 at 10:27 AM. Reason: Removed message text after discovering the cln2qual is perl, not binary.
kmcarr is offline   Reply With Quote
Old 10-08-2009, 01:37 AM   #12
Eugeni
Junior Member
 
Location: spain

Join Date: May 2009
Posts: 4
Default

Quote:
Originally Posted by kmcarr View Post
Here is a perl script to convert FASTA + QUAL files to FASTQ. You would need to first generate the FASTA and QUAL files from the SFF file using a tool like sffinfo from Roche or sff_extract.

Code:
#!/usr/bin/perl

use warnings;
use strict;
use File::Basename;

my $inFasta = $ARGV[0];
my $baseName = basename($inFasta, qw/.fasta .fna/);
my $inQual = $baseName . ".qual";
my $outFastq = $baseName . ".fastq";

my %seqs;

$/ = ">";

open (FASTA, "<$inFasta");
my $junk = (<FASTA>);

while (my $frecord = <FASTA>) {
	chomp $frecord;
	my ($fdef, @seqLines) = split /\n/, $frecord;
	my $seq = join '', @seqLines;
	$seqs{$fdef} = $seq;
}

close FASTA;

open (QUAL, "<$inQual");
$junk = <QUAL>;
open (FASTQ, ">$outFastq");

while (my $qrecord = <QUAL>) {
	chomp $qrecord;
	my ($qdef, @qualLines) = split /\n/, $qrecord;
	my $qualString = join ' ', @qualLines;
	my @quals = split / /, $qualString;
	print FASTQ "@","$qdef\n";
	print FASTQ "$seqs{$qdef}\n";
	print FASTQ "+\n";
	foreach my $qual (@quals) {
		print FASTQ chr($qual + 33);
	}
	print FASTQ "\n";
}

close QUAL;
close FASTQ;
Usage notes:

- Run the program just pass it the name of the fasta sequence file, e.g.

Code:
%> fastaQual2fastq.pl foo.fasta
(assuming you saved the above code with the name 'fastaQual2fastq.pl')

- The fasta filename must end in either .fasta or .fna

- The quality filename must have the same basename as the fasta file and end with .qual. For example, if your sequence file is "foo.fna" then the quality file must be named "foo.qual".
Hi, kmcarr
Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
Dou you know what happens, if it is important?
Thanks a lot
Eugeni is offline   Reply With Quote
Old 10-08-2009, 05:17 AM   #13
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

Quote:
Originally Posted by Eugeni View Post
Hi, kmcarr
Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
Dou you know what happens, if it is important?
Thanks a lot
Does the warning only appear once? How many entries are in your FASTA/QUAL files?
kmcarr is offline   Reply With Quote
Old 10-08-2009, 07:11 AM   #14
Eugeni
Junior Member
 
Location: spain

Join Date: May 2009
Posts: 4
Default

Quote:
Originally Posted by kmcarr View Post
Does the warning only appear once? How many entries are in your FASTA/QUAL files?
The warning appears associated to all sequences; i have 380185 fasta/qual entries
Eugeni is offline   Reply With Quote
Old 10-08-2009, 07:31 AM   #15
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Just a guess, but you could check your line endings (DOS/Windows versus Unix).
maubp is offline   Reply With Quote
Old 10-22-2009, 08:06 PM   #16
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Quote:
Originally Posted by Eugeni View Post
Hi, kmcarr
Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
Dou you know what happens, if it is important?
Thanks a lot
Some of the quality values have extra spaces depending the number of digits. We just have to make sure there is exactly 1 space between
them:

--- fastaQual2fastaq.pl.orig 2009-10-22 22:05:24.000000000 -0500
+++ fastaQual2fastaq.pl 2009-10-22 22:04:54.000000000 -0500
@@ -33,6 +33,7 @@
chomp $qrecord;
my ($qdef, @qualLines) = split /\n/, $qrecord;
my $qualString = join ' ', @qualLines;
+ $qualString =~ s/\s+/ /g;
my @quals = split / /, $qualString;
print FASTQ "@","$qdef\n";
print FASTQ "$seqs{$qdef}\n";
drio is offline   Reply With Quote
Old 10-22-2009, 08:18 PM   #17
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

Nice catch drio, thanks. One of those really subtle things you don't catch until you work with a different set of files.

Eugeni, sorry I didn't get back to you on this; got really crushed at work. I have uploaded a modified version of the script incorporating drio's fix.
Attached Files
File Type: pl fastaQual2fastq.pl (926 Bytes, 1228 views)

Last edited by kmcarr; 10-22-2009 at 08:22 PM.
kmcarr is offline   Reply With Quote
Old 10-23-2009, 02:13 AM   #18
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Seeing as the thread has shifted from SFF to FASTQ, to the easier task of FASTA+QUAL to FASTQ, here is a Biopython solution which will work on Biopython 1.51 or later:

Code:
from Bio import SeqIO
from Bio.SeqIO.QualityIO import PairedFastaQualIterator
handle = open("temp.fastq", "w") #w=write
records = PairedFastaQualIterator(open("example.fasta"), open("example.qual"))
count = SeqIO.write(records, handle, "fastq")
handle.close()
print "Converted %i records" % count
This example will be included in the next edition of the Biopython Tutorial. Adding simple command line parsing using sys.argv is left as an exercise for the reader

A future version of Biopython should also let you go directly from SFF to FASTQ (or FASTA, or QUAL, or ...) which will be much simpler. This code is already written and can be tested by the adventurous

Peter
maubp is offline   Reply With Quote
Old 03-26-2010, 03:22 PM   #19
idas
Junior Member
 
Location: St. Louis, MO USA

Join Date: Mar 2010
Posts: 3
Default sff2fastq

To Whomever That Maybe Interested:

I have recently release a program called 'sff2fastq' onto github that does a direct SFF to FASTQ format conversion. 'sff2fastq' is implemented in the C language and should compile on *NIX type operating systems (Linux, BSD-type, & Mac OS X).

The FASTQ output produced is of the Sanger FASTQ format.

The source code & compilation instructions are available via the following github url:

http://github.com/indraniel/sff2fastq

If the git version control software is not available on your system please visit the following link for installation instructions:

http://help.github.com/git-installation-redirect

Any feedback about the program would be appreciated. Bug reports are very much welcomed, although I can't guarantee when they will be addressed.

Sincerely,
Indraniel Das

The Genome Center at Washington University
idas is offline   Reply With Quote
Old 03-27-2010, 07:33 AM   #20
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Quote:
Originally Posted by maubp View Post
A future version of Biopython should also let you go directly from SFF to FASTQ (or FASTA, or QUAL, or ...) which will be much simpler. This code is already written and can be tested by the adventurous
This will be in Biopython 1.54 due out shortly (probably April 2010), and can be tested no if you install the latest Biopython from the repository. A simple Biopython script for SFF to FASTQ would be just:
Code:
from Bio import SeqIO
SeqIO.convert("example.sff", "sff", "untrimmed.fastq", "fastq")
Or:
Code:
from Bio import SeqIO
SeqIO.convert("example.sff", "sff-trim", "trimmed.fastq", "fastq")
Note this does not handle paired end SFF files which requires the reads be analysed to look for the linker sequence. You can use sff_extract for that.
maubp is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:56 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO