Seqanswers Leaderboard Ad

**lparsons** · 04-01-2009, 01:28 PM

It's a relatively simple change, since the format is now (phred+64) and the standard is (phred+33).

So, I added a function to the fq_all2std.pl script in the MAQ scripts subdirectory:

Code:

sub sol2std2 {
	my $max = 0;
	while (<>) {
		if (/^@/) {
			print;
			$_ = <>;
			print;
			$_ = <>;
			$_ = <>;

			# Added to eliminate carriage return conversion
			chomp;
			my @t = split( '', $_ );
			my $qual = '';
			$qual .= chr(ord($_) - 31) for (@t);
			print "+\n$qual\n";
		}
	}
}

Then just add it as a valid command by adding

Code:

sol2std2    => \&sol2std2,

to the my %cmd_hash line.

**jkbonfield** · 04-03-2009, 08:13 AM

The main problem with this new format is that it's now nigh on impossible to tell the difference between phred+64 and logodds+64 formats without resorting to a large amount of statistical analysis on the file contents.

It's easy enough to convert of course, but knowing precisely what format your input data is in is getting trickier by the day. Time for fastq to retire I think!

James

**bioinfosm** · 04-06-2009, 08:47 AM

I totally second that thought. Mapping algorithms that expect some form of Quality values, given others, still give you mapped reads! But the accuracy and efficiency can be very different..

**TylerBackman** · 04-06-2009, 12:19 PM

Originally posted by jkbonfield View Post

It's easy enough to convert of course, but knowing precisely what format your input data is in is getting trickier by the day. Time for fastq to retire I think!

Fastq just needs to be standardized. It looks to me like everyone is eventually moving to Sanger/Phred scores for all fastq files; hopefully the next Illumina pipeline version will produce this as well.

**Layla** · 05-19-2009, 05:11 AM

sol2std2 function

Hello lparsons

Trying to convert my solexa phred 64 qualities in ascii format to phred 33 I realised I was unable to use ./maq sol2sanger. I came across your method and added this command into the fq_all2std.pl script in maq. However when executing, I get an error

./fq_all2std.pl test.txt one.fastq
** Unrecognized command test.txt at ./fq_all2std.pl line 45.

#line 45 in the script is die("** Unrecognized command $cmd");

I added sol2std2 => \&sol2std2 into the my %cmd_hash line and also the script before the sub instruction { command.

Any help would be much appreciated

Cheers

L

**Layla** · 05-19-2009, 05:24 AM

command missed out

oops i was missing specifying the sol2std2 command

However is there anywhere where I can understand the meaning of the #,!@ etc etc symbols?

Cheers
L

**lparsons** · 05-19-2009, 07:40 AM

It sounds like you were able to get things working. Let me know if you are still having trouble.

As for understanding the meaning of the symbols, do you mean you would like to get the corresponding numerical qualities? If so, you could modify the script to output the numeric qualities or just look at an ASCII table and subtract the appropriate value.

If you would like an explanation of what the numbers mean, you could start here: http://maq.sourceforge.net/qual.shtml

**bioinfosm** · 05-19-2009, 11:30 AM

I used maq to call SNPs on a dataset. Using sol2sanger I get 800 odd SNPs reported after the recommended filtering. However, not using sol2sanger gives a whooping 11000 odd SNP calls, al other pipeline remaining same!

These are solexa v1.3 generated reads .. and I am not sure why this huge difference, and which one to trust

**Layla** · 05-20-2009, 01:27 AM

hi bioinfosm

I can try and help you but someone correct me if I am wrong

.

Solexa v1.3 reads are phred 64 probability scores instead of absolute base values. These need to be converted to phred 33 probabilities.

The sol2sanger is ok for converting the absolute base values to phred 33. but not suitable for converting phred 64 to phred 33 unless you adjust the fq_all2std.pl script using lparsons which method worked for me.

Phred scores probability scores of how correct the nucleotide is that has been added and you would need to adjust the v1.3 probability scores to this standard sanger format before using maq.

HTH
L

**Layla** · 05-20-2009, 02:59 AM

thanx lparsons, I stumbled upon a pdf table showing what the symbols means, it was a pdf i found online. Do you have any idea how maq handles N's? I have reads with many N's and was thinking to eliminate reads where N=>20 from the raw solexa data before I do any conversions with maq.....

Cheers
L

**kmcarr** · 05-20-2009, 04:58 AM

Originally posted by bioinfosm View Post

I used maq to call SNPs on a dataset. Using sol2sanger I get 800 odd SNPs reported after the recommended filtering. However, not using sol2sanger gives a whooping 11000 odd SNP calls, al other pipeline remaining same!

These are solexa v1.3 generated reads .. and I am not sure why this huge difference, and which one to trust

Trust the first one, using the sol2sanger conversion. The pipeline 1.3 scores are represented as ASCII(phred+64). Maq is expecting the qualities to be represented in the Sanger manner of ASCII(phred+33). If you do not first run sol2sanger, then when Maq encounters, for example, a 'D' (ASCII=68) in the quality string it will subtract 33 from this and give it a phred score of 35, which is pretty darn good. But since the file was still in Illumina FASTQ format the true phred score is 4 (68-64) which is pretty darn bad. By not running the file through sol2sanger you have essentially added 31 to the phred score of each and every base. Since Maq believes every mismatch it sees are from high quality base calls it will call them as SNPs but they are really just sequencing errors.

**bioinfosm** · 05-20-2009, 08:27 AM

Thanks kmcarr.. the one follow-up query is, what were the pipeline 1.1 scores then? I heard that there has been a change in solexa's fastq qualities..

**kmcarr** · 05-20-2009, 09:28 AM

If we call the current (pipeline 1.3.2) Q(phred)+64 then the previous version could be called Q(solexa)+64. The difference between Phred and Solexa qualities has been well described by Heng Li in the documentation of his Maq package (http://maq.sourceforge.net/qual.shtml). These differ most significantly at the low end, with Q(solexa) allowing negative numbers. At Q scores above ~11 the two are essentially identical.

Technically the sol2sanger conversion is meant to convert Q(solexa)+64 into Q(phred)+33. There will be slight errors in the scores assigned for low quality bases. I actually added a new command and subroutine to the fq_all2std.pl script to deal with Solexa FASTQ from v1.3.2.

Add a new command named "solP2std" to the %cmd_hash:

solP2std=>\&solP2std,

Add the following to create a hash to convert from Q(phred)+64 to Q(phred)+33.

--

my %solP2stdP;
for (64..126) {
$solP2stdP{chr($_)} = chr($_-31);
}

--

Add the following subroutine to do the conversion:

--

sub solP2std {
while (<>) {
if (/^@/) {
print;
$_ = <>; print; $_ = <>; $_ = <>;
chomp;
my @t = split('', $_);
my $qual = '';
$qual .= $solP2stdP{$_} for (@t);
print "+\n$qual\n";
}
}
}

--
[Arrg! Stupid whitespace stripping messing up my code.]

To use this on a fastq produced by the v1.3.2 pipeline:

fq_all2std solP2std mySolexa_1.3.2_File.fastq > myStandardSanger_File.fastq

**bioinfosm** · 05-20-2009, 11:05 AM

Originally posted by Layla View Post

hi bioinfosm

I can try and help you but someone correct me if I am wrong

.

Solexa v1.3 reads are phred 64 probability scores instead of absolute base values. These need to be converted to phred 33 probabilities.

The sol2sanger is ok for converting the absolute base values to phred 33. but not suitable for converting phred 64 to phred 33 unless you adjust the fq_all2std.pl script using lparsons which method worked for me.

Phred scores probability scores of how correct the nucleotide is that has been added and you would need to adjust the v1.3 probability scores to this standard sanger format before using maq.

HTH
L

Thanks Layla. As I understand, after converting phred 64 to phred 33, there is no need to run sol2sanger, and one can directly convert the reads to bfq and run maq map...

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 23 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Illumina pipeline 1.3 fastq and Maq sol2sanger

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News