SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq byb121 Bioinformatics 6 12-20-2013 01:26 AM
Maq - sol2sanger problem - different sizes for the pair? cliff Bioinformatics 20 01-13-2010 04:48 AM
Maq v Pipeline vasvale Bioinformatics 1 09-18-2009 09:37 AM
Maq sol2sanger and solexa paired end reads Rick Bioinformatics 3 05-19-2009 03:14 AM

Reply
 
Thread Tools
Old 04-01-2009, 06:33 AM   #1
Rick
Junior Member
 
Location: UK

Join Date: Aug 2008
Posts: 2
Default Illumina pipeline 1.3 fastq and Maq sol2sanger

Hi,

I see that the Maq sol2sanger (v0.7.1) has not yet been updated to handle the new pipeline fastq files (now has Phred scores) from pipeline 1.3

Does anyone have a handy script / advice for formatting the illumina fastq files for Maq??

thanks folks

Rick
Rick is offline   Reply With Quote
Old 04-01-2009, 01:28 PM   #2
lparsons
Member
 
Location: NJ

Join Date: Nov 2008
Posts: 28
Default

It's a relatively simple change, since the format is now (phred+64) and the standard is (phred+33).

So, I added a function to the fq_all2std.pl script in the MAQ scripts subdirectory:

Code:
sub sol2std2 {
	my $max = 0;
	while (<>) {
		if (/^@/) {
			print;
			$_ = <>;
			print;
			$_ = <>;
			$_ = <>;

			# Added to eliminate carriage return conversion
			chomp;
			my @t = split( '', $_ );
			my $qual = '';
			$qual .= chr(ord($_) - 31) for (@t);
			print "+\n$qual\n";
		}
	}
}
Then just add it as a valid command by adding

Code:
sol2std2    => \&sol2std2,
to the my %cmd_hash line.
lparsons is offline   Reply With Quote
Old 04-03-2009, 08:13 AM   #3
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

The main problem with this new format is that it's now nigh on impossible to tell the difference between phred+64 and logodds+64 formats without resorting to a large amount of statistical analysis on the file contents.

It's easy enough to convert of course, but knowing precisely what format your input data is in is getting trickier by the day. Time for fastq to retire I think!

James
jkbonfield is offline   Reply With Quote
Old 04-06-2009, 08:47 AM   #4
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

I totally second that thought. Mapping algorithms that expect some form of Quality values, given others, still give you mapped reads! But the accuracy and efficiency can be very different..
bioinfosm is offline   Reply With Quote
Old 04-06-2009, 12:19 PM   #5
TylerBackman
Member
 
Location: Riverside, CA

Join Date: Oct 2008
Posts: 13
Default

Quote:
Originally Posted by jkbonfield View Post
It's easy enough to convert of course, but knowing precisely what format your input data is in is getting trickier by the day. Time for fastq to retire I think!
Fastq just needs to be standardized. It looks to me like everyone is eventually moving to Sanger/Phred scores for all fastq files; hopefully the next Illumina pipeline version will produce this as well.
TylerBackman is offline   Reply With Quote
Old 05-19-2009, 05:11 AM   #6
Layla
Member
 
Location: London

Join Date: Sep 2008
Posts: 58
Default sol2std2 function

Hello lparsons

Trying to convert my solexa phred 64 qualities in ascii format to phred 33 I realised I was unable to use ./maq sol2sanger. I came across your method and added this command into the fq_all2std.pl script in maq. However when executing, I get an error
./fq_all2std.pl test.txt one.fastq
** Unrecognized command test.txt at ./fq_all2std.pl line 45.

#line 45 in the script is die("** Unrecognized command $cmd");

I added sol2std2 => \&sol2std2 into the my %cmd_hash line and also the script before the sub instruction { command.

Any help would be much appreciated

Cheers

L
Layla is offline   Reply With Quote
Old 05-19-2009, 05:24 AM   #7
Layla
Member
 
Location: London

Join Date: Sep 2008
Posts: 58
Default command missed out

oops i was missing specifying the sol2std2 command

However is there anywhere where I can understand the meaning of the #,!@ etc etc symbols?

Cheers
L
Layla is offline   Reply With Quote
Old 05-19-2009, 07:40 AM   #8
lparsons
Member
 
Location: NJ

Join Date: Nov 2008
Posts: 28
Default

It sounds like you were able to get things working. Let me know if you are still having trouble.

As for understanding the meaning of the symbols, do you mean you would like to get the corresponding numerical qualities? If so, you could modify the script to output the numeric qualities or just look at an ASCII table and subtract the appropriate value.

If you would like an explanation of what the numbers mean, you could start here: http://maq.sourceforge.net/qual.shtml
lparsons is offline   Reply With Quote
Old 05-19-2009, 11:30 AM   #9
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

I used maq to call SNPs on a dataset. Using sol2sanger I get 800 odd SNPs reported after the recommended filtering. However, not using sol2sanger gives a whooping 11000 odd SNP calls, al other pipeline remaining same!

These are solexa v1.3 generated reads .. and I am not sure why this huge difference, and which one to trust
bioinfosm is offline   Reply With Quote
Old 05-20-2009, 01:27 AM   #10
Layla
Member
 
Location: London

Join Date: Sep 2008
Posts: 58
Default

hi bioinfosm

I can try and help you but someone correct me if I am wrong .

Solexa v1.3 reads are phred 64 probability scores instead of absolute base values. These need to be converted to phred 33 probabilities.

The sol2sanger is ok for converting the absolute base values to phred 33. but not suitable for converting phred 64 to phred 33 unless you adjust the fq_all2std.pl script using lparsons which method worked for me.

Phred scores probability scores of how correct the nucleotide is that has been added and you would need to adjust the v1.3 probability scores to this standard sanger format before using maq.

HTH
L
Layla is offline   Reply With Quote
Old 05-20-2009, 02:59 AM   #11
Layla
Member
 
Location: London

Join Date: Sep 2008
Posts: 58
Default

thanx lparsons, I stumbled upon a pdf table showing what the symbols means, it was a pdf i found online. Do you have any idea how maq handles N's? I have reads with many N's and was thinking to eliminate reads where N=>20 from the raw solexa data before I do any conversions with maq.....

Cheers
L
Layla is offline   Reply With Quote
Old 05-20-2009, 04:58 AM   #12
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by bioinfosm View Post
I used maq to call SNPs on a dataset. Using sol2sanger I get 800 odd SNPs reported after the recommended filtering. However, not using sol2sanger gives a whooping 11000 odd SNP calls, al other pipeline remaining same!

These are solexa v1.3 generated reads .. and I am not sure why this huge difference, and which one to trust
Trust the first one, using the sol2sanger conversion. The pipeline 1.3 scores are represented as ASCII(phred+64). Maq is expecting the qualities to be represented in the Sanger manner of ASCII(phred+33). If you do not first run sol2sanger, then when Maq encounters, for example, a 'D' (ASCII=68) in the quality string it will subtract 33 from this and give it a phred score of 35, which is pretty darn good. But since the file was still in Illumina FASTQ format the true phred score is 4 (68-64) which is pretty darn bad. By not running the file through sol2sanger you have essentially added 31 to the phred score of each and every base. Since Maq believes every mismatch it sees are from high quality base calls it will call them as SNPs but they are really just sequencing errors.
kmcarr is offline   Reply With Quote
Old 05-20-2009, 08:27 AM   #13
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Thanks kmcarr.. the one follow-up query is, what were the pipeline 1.1 scores then? I heard that there has been a change in solexa's fastq qualities..
bioinfosm is offline   Reply With Quote
Old 05-20-2009, 09:28 AM   #14
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

If we call the current (pipeline 1.3.2) Q(phred)+64 then the previous version could be called Q(solexa)+64. The difference between Phred and Solexa qualities has been well described by Heng Li in the documentation of his Maq package (http://maq.sourceforge.net/qual.shtml). These differ most significantly at the low end, with Q(solexa) allowing negative numbers. At Q scores above ~11 the two are essentially identical.

Technically the sol2sanger conversion is meant to convert Q(solexa)+64 into Q(phred)+33. There will be slight errors in the scores assigned for low quality bases. I actually added a new command and subroutine to the fq_all2std.pl script to deal with Solexa FASTQ from v1.3.2.

Add a new command named "solP2std" to the %cmd_hash:

solP2std=>\&solP2std,

Add the following to create a hash to convert from Q(phred)+64 to Q(phred)+33.

--

my %solP2stdP;
for (64..126) {
$solP2stdP{chr($_)} = chr($_-31);
}

--

Add the following subroutine to do the conversion:

--

sub solP2std {
while (<>) {
if (/^@/) {
print;
$_ = <>; print; $_ = <>; $_ = <>;
chomp;
my @t = split('', $_);
my $qual = '';
$qual .= $solP2stdP{$_} for (@t);
print "+\n$qual\n";
}
}
}

--
[Arrg! Stupid whitespace stripping messing up my code.]

To use this on a fastq produced by the v1.3.2 pipeline:

fq_all2std solP2std mySolexa_1.3.2_File.fastq > myStandardSanger_File.fastq

Last edited by kmcarr; 05-20-2009 at 09:32 AM.
kmcarr is offline   Reply With Quote
Old 05-20-2009, 11:05 AM   #15
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Quote:
Originally Posted by Layla View Post
hi bioinfosm

I can try and help you but someone correct me if I am wrong .

Solexa v1.3 reads are phred 64 probability scores instead of absolute base values. These need to be converted to phred 33 probabilities.

The sol2sanger is ok for converting the absolute base values to phred 33. but not suitable for converting phred 64 to phred 33 unless you adjust the fq_all2std.pl script using lparsons which method worked for me.

Phred scores probability scores of how correct the nucleotide is that has been added and you would need to adjust the v1.3 probability scores to this standard sanger format before using maq.

HTH
L
Thanks Layla. As I understand, after converting phred 64 to phred 33, there is no need to run sol2sanger, and one can directly convert the reads to bfq and run maq map...
bioinfosm is offline   Reply With Quote
Old 05-20-2009, 11:41 AM   #16
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

my previous workflow to go from solexa's sequence.txt to maq mapping went like this

maq sol2sanger s_1_sequence.txt w
maq fastq2bfq w reads-1.bfq 2> /dev/null
maq map -u lane1.unmapped reads-1.map maq.bfa reads-1.bfq 2> reads-1.map.log

and i would think that sol2sts2 would run on sequence.txt and the output would directly go to maq map..
bioinfosm is offline   Reply With Quote
Old 05-21-2009, 12:08 AM   #17
seq_GA
Senior Member
 
Location: Asiana

Join Date: Feb 2009
Posts: 124
Default

Can anyone clearly say that with solexa pipeline 1.3.4 output, do we need to convert 'maq sol2sanger s_1_sequence.txt s_1_sequence.fastq' and then proceed to 'maq fastq2bfq' command?

Thanks.
seq_GA is offline   Reply With Quote
Old 05-21-2009, 01:47 AM   #18
maising
Junior Member
 
Location: UK

Join Date: Mar 2009
Posts: 1
Default

Quote:
Originally Posted by jkbonfield View Post
The main problem with this new format is that it's now nigh on impossible to tell the difference between phred+64 and logodds+64 formats without resorting to a large amount of statistical analysis on the file contents.

It's easy enough to convert of course, but knowing precisely what format your input data is in is getting trickier by the day. Time for fastq to retire I think!

James
The rationale for moving to phred+64 was that
a) the phred scale is more standard and the logodds don't add much benefit for single quality scores (instead of 4 scores per base)
b) the offset of 64 is what previous versions of the pipeline produced, so the idea was to follow the principle of least surprise.

Also the assumption was that for practical purposes, there is very little difference between logodds and phred when applied to a single quality value, so existing conversion scripts could easily be modified to accommodate qseq files.

Replacing fastq with something better would certainly help.

Cheers
Klaus

Disclaimer: I work at Illumina.
maising is offline   Reply With Quote
Old 05-22-2009, 02:58 PM   #19
caddymob
Member
 
Location: USA

Join Date: Apr 2009
Posts: 36
Default

Thanks for your help on this lparsons and everyone! Kinda sucks I have to go back and rerun some analysis, but thats the beauty of making bash scripts I guess.

EDIT: Just to clarify. Once I do sol2std2 my data is now in standard/Sanger FASTQ format -- I SHOULD NOT do sol2sanger. Right? Think I'm on board...

Last edited by caddymob; 05-22-2009 at 03:07 PM.
caddymob is offline   Reply With Quote
Old 09-11-2009, 02:06 AM   #20
Youri
Junior Member
 
Location: holland

Join Date: Sep 2009
Posts: 2
Default

Ok I'm pretty new to this, and I cant get the code to work (both Iparsons and kmcarr's). I've edited the fq_all2std.pl correctly, but cant get the command right. So what is the exact command that has to be used?

Youri.
Youri is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:37 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO