SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
【10 USD reward】Probability question for my experiment! Please help me! Godevil Bioinformatics 22 12-16-2011 06:01 AM
Questions on the updated illumina quality score zeam Bioinformatics 6 10-26-2011 11:08 AM
Two Version of Solexa Quality Score Formula foolishbrat Bioinformatics 1 02-24-2009 01:59 AM
Interpreting Quality Score (Solexa) foolishbrat General 2 01-09-2009 12:21 PM
Questions about solexa quality score baohua100 Bioinformatics 1 06-17-2008 08:09 AM

Reply
 
Thread Tools
Old 06-17-2008, 05:48 AM   #1
baohua100
Senior Member
 
Location: Canada

Join Date: Jun 2008
Posts: 103
Question Questions about solexa quality score!

reads.fq file:

@4:1:518:715
GATACCATAAAAGCTGGATCCTTCTTCAAGCATAA
+4:1:518:715
hhhhhhhhhhhhhhhdhhhhhhhhhhhdRehdhhP

1. How to change character (like 'e' or 'h') to quality score?

2. What's the meaning of this score? How to compute this score ( formula )?
baohua100 is offline   Reply With Quote
Old 06-17-2008, 06:26 AM   #2
Farhat
Member
 
Location: Pune, India

Join Date: Apr 2008
Posts: 21
Default

For a Fastq file, if the quality character is $q the corresponding Phred quality can be calculated with the following Perl code:

$Q = ord($q) - 33;
Farhat is offline   Reply With Quote
Old 06-17-2008, 07:04 AM   #3
SoupDragon
Junior Member
 
Location: UK

Join Date: Jun 2008
Posts: 1
Default

This is correct if you are using quality scores encoded in "fastq" format. I believe the Illimina pipeline used a different ascii offset (64) according to their pipeline documentation. A value of zero = ascii 64 ('@'). The ascii value for a qv is therefore qv+64. So "h" = 104 - 64 = 40
SoupDragon is offline   Reply With Quote
Old 06-17-2008, 07:09 AM   #4
Farhat
Member
 
Location: Pune, India

Join Date: Apr 2008
Posts: 21
Default

Dupe. Deleted.

Last edited by Farhat; 06-17-2008 at 07:20 AM.
Farhat is offline   Reply With Quote
Old 06-17-2008, 07:10 AM   #5
Farhat
Member
 
Location: Pune, India

Join Date: Apr 2008
Posts: 21
Default

Quote:
Originally Posted by SoupDragon View Post
This is correct if you are using quality scores encoded in "fastq" format. I believe the Illimina pipeline used a different ascii offset (64) according to their pipeline documentation. A value of zero = ascii 64 ('@'). The ascii value for a qv is therefore qv+64. So "h" = 104 - 64 = 40
You are right. 'h' would make the quality way beyond 40 by my calculation.
Farhat is offline   Reply With Quote
Old 06-17-2008, 05:02 PM   #6
baohua100
Senior Member
 
Location: Canada

Join Date: Jun 2008
Posts: 103
Default

Thanks.

what's the range of this score ? (0---40 ?)

what's the meaning of this score?
baohua100 is offline   Reply With Quote
Old 06-17-2008, 05:07 PM   #7
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default Solexa Quality Score

The range is from -5 to 40

If P is probability of base then Solexa quality is 10 log10(P/(1-P))

A quality of -5 corresponds to P=0.25
sparks is offline   Reply With Quote
Old 06-18-2008, 07:36 AM   #8
Farhat
Member
 
Location: Pune, India

Join Date: Apr 2008
Posts: 21
Default

Quote:
Originally Posted by sparks View Post
The range is from -5 to 40

If P is probability of base then Solexa quality is 10 log10(P/(1-P))

A quality of -5 corresponds to P=0.25
In my datasets the range has been from -40 to 40.
Farhat is offline   Reply With Quote
Old 06-18-2008, 05:37 PM   #9
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default Quality Score Range

Farhats right for Solexa prb file formats from the base caller but for fastq format files the OP asked about, the range should be -5 to 40
sparks is offline   Reply With Quote
Old 06-19-2008, 10:15 AM   #10
Farhat
Member
 
Location: Pune, India

Join Date: Apr 2008
Posts: 21
Default

Quote:
Originally Posted by sparks View Post
Farhats right for Solexa prb file formats from the base caller but for fastq format files the OP asked about, the range should be -5 to 40
Yes, that's right, because for solexa PRB file the probability of A,C,G or T is given separately, and can be really low, whereas for fastq the lowest probability is 0.25 implying equal probability for any nucleotide.
Farhat is offline   Reply With Quote
Old 07-18-2008, 11:29 PM   #11
baohua100
Senior Member
 
Location: Canada

Join Date: Jun 2008
Posts: 103
Default

$sQ = -10 * log($e / (1 - $e))

when $sQ =40, $e=0.0001

when $sQ=0, $e=0.5

0.5>0.25


when $sQ=-4 $e=0.72



what's the probalibity of error?????????????????????????????
baohua100 is offline   Reply With Quote
Old 07-23-2008, 12:25 AM   #12
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

If you are talking fastq format and have a quality of -4 then the probability of the base called is 0.28 and probability it is anyone of the other 3 bases is 0.72.

If you see a -4 in a prb format file then the probability of the base is 0.28 and the other bases will each have their own prb/qual value.
sparks is offline   Reply With Quote
Old 09-22-2008, 02:05 PM   #13
mikertesz
Junior Member
 
Location: Israel

Join Date: Sep 2008
Posts: 1
Default "quala" files

The output of a Solexa run generated a "quala" file of the following format:

>sequence_0
40 40 40 19 7 40 40 40 40 40 31 40 40 40 40 40 40 40 40 40 40 11 40 40 40
36 40 12 40 21 39 1 4 40 40 15 40 40 4 40 40 10 40 40 40 40 40 2 4 10
1

>sequence_1
40 40 8 13 12 40 40 40 40 17 27 40 25 17 4 40 40 40 21 40 40 37 40 40 37
4 40 33 40 25 40 3 20 40 40 20 40 40 4 40 8 7 40 40 15 4 10 1 5 20
1

etc...

Does anybody know what those numbers mean? Are those simply the Solexa quality scores per base-pair? The range seems to be 1-40 --- why isn't it -5 to 40 as in fasq?
mikertesz is offline   Reply With Quote
Old 10-07-2008, 08:57 AM   #14
vruotti
Member
 
Location: US

Join Date: Feb 2008
Posts: 13
Default Fastq file outside of GERALD

Hi,
Does anyone know an easy way or an existing program to convert all the .prb files from one particular lane into one fastq file? Similar to the s_1_sequence.txt file but with no filters applied?
We have trying hacking around the Perl scripts within GERALD but looks like you need an intermediate seqpre.tmp file which I think gets deleted after the completion of GERALD.

We know this is possible by just running GERALD with the fastq parameter. However, we would like to generate a fastq file that is not affected by GERALD's filters. That way we can set up our own quality filters.

Any ideas?
Do I go ahead and write one?

Thanks,
Victor
vruotti is offline   Reply With Quote
Old 10-07-2008, 09:11 AM   #15
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

I made my own very simple script, but here's a script of James Bonfield's here:

http://seqanswers.com/forums/showthread.php?t=282

The only problem ithat I see is this line

foreach (glob("$fn/*seq.txt")) {

which is going to get every single .seq in the directory, not just the ones from a single lane. So you'll have to fix that.
swbarnes2 is offline   Reply With Quote
Old 10-07-2008, 09:38 AM   #16
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,169
Default

Victor,

Run GERALD including the following line in the GERALD configuration file:

QF_PARAMS '(1==1)'

This is a conditional which is true 100% of the time; in other words, GERALD passes every read.

(This technique comes from the Pipeline User Guide)

Last edited by kmcarr; 10-07-2008 at 09:44 AM. Reason: To correct line and add attribution.
kmcarr is offline   Reply With Quote
Old 10-10-2008, 08:05 AM   #17
ShaunMahony
Member
 
Location: University Park, PA

Join Date: Apr 2008
Posts: 27
Default

As said below (and also in the Solexa documentation), Solexa quality scores in their Fastq-like format are given by 10*log_10(P/(1-P)). I thought it might be useful for some people if I posted a lookup table based on this. Note I'm giving the probability that a base is erroneous, rounded to four decimal places. Please post a reply if you think this table is an incorrect translation:

Char ASCII Char-64 P(error)
; 59 -5 0.7597
< 60 -4 0.7153
= 61 -3 0.6661
> 62 -2 0.6131
? 63 -1 0.5573
@ 64 0 0.5000
A 65 1 0.4427
B 66 2 0.3869
C 67 3 0.3339
D 68 4 0.2847
E 69 5 0.2403
F 70 6 0.2008
G 71 7 0.1663
H 72 8 0.1368
I 73 9 0.1118
J 74 10 0.0909
K 75 11 0.0736
L 76 12 0.0594
M 77 13 0.0477
N 78 14 0.0383
O 79 15 0.0307
P 80 16 0.0245
Q 81 17 0.0196
R 82 18 0.0156
S 83 19 0.0124
T 84 20 0.0099
U 85 21 0.0079
V 86 22 0.0063
W 87 23 0.0050
X 88 24 0.0040
Y 89 25 0.0032
Z 90 26 0.0025
[ 91 27 0.0020
\ 92 28 0.0016
] 93 29 0.0013
^ 94 30 0.0010
_ 95 31 0.0008
` 96 32 0.0006
a 97 33 0.0005
b 98 34 0.0004
c 99 35 0.0003
d 100 36 0.0003
e 101 37 0.0002
f 102 38 0.0002
g 103 39 0.0001
h 104 40 0.0001
ShaunMahony is offline   Reply With Quote
Old 10-10-2008, 11:19 AM   #18
vruotti
Member
 
Location: US

Join Date: Feb 2008
Posts: 13
Default More on Quality

Hello,
We are looking a little closer at the quality of one of our runs. Interestingly, we see a pattern in most of our runs right at the 30th cycle. The information from the graph below comes from the s_N_export.txt files. Please ignore the graph from lane 4. This was a failed lane. The others however, including our control (lane 8) show this pattern.

This was an IPAR run with the upgraded GAII and was one of our best runs. Other runs also show this pattern at the 30th cycle. Does anyone know the reason why the qualities drop so much after the 30th cycle? Have you seem this before in any of your runs?

Thanks in advance.
Victor

Last edited by vruotti; 10-10-2008 at 01:52 PM.
vruotti is offline   Reply With Quote
Old 11-20-2008, 12:29 PM   #19
TylerBackman
Member
 
Location: Riverside, CA

Join Date: Oct 2008
Posts: 13
Default

Quote:
Originally Posted by vruotti View Post
Does anyone know the reason why the qualities drop so much after the 30th cycle?
This is most likely because the qualities you are looking at are alignment normalized, and a large number of your sequences failed to align to the reference genome (due to a ligated adapter, etc.)

Take a look at the un-normalized scores (s_<lane>_qraw.txt) instead, I think you'll find that the curve is more continuous between cycles.
TylerBackman is offline   Reply With Quote
Old 02-18-2009, 07:29 AM   #20
baohua100
Senior Member
 
Location: Canada

Join Date: Jun 2008
Posts: 103
Default

fastq file:
@I326_2_FC306FCAAXX:8:1:50:985
ATGTCCGAAGGGCAGTCTCAAGTGGTAAAATGGAT
+I326_2_FC306FCAAXX:8:1:50:985
hhhWhhhchhhhhahShh\PO]LgXZXPNLUTZNO


MAQ alignment output:

I326_2_FC306FCAAXX:8:1:50:985 1 1 + 0 0 99 99 99 0 0 1 0 35 ATGTCCGAAGGGCAGTCTCAAGTGGTAAAATGGAT ```W```````````S``\PO]L`XZXPNLUTZNO

what's the meaning of ```W```````````S``\PO]L`XZXPNLUTZNO ?

not the same as hhhWhhhchhhhhahShh\PO]LgXZXPNLUTZNO
baohua100 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO