Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post foolishbrat General 3 10-11-2020 07:48 AM Godevil Bioinformatics 22 12-16-2011 07:01 AM zeam Bioinformatics 6 10-26-2011 12:08 PM foolishbrat Bioinformatics 1 02-24-2009 02:59 AM baohua100 Bioinformatics 1 06-17-2008 09:09 AM

 06-17-2008, 06:48 AM #1 baohua100 Senior Member   Location: Canada Join Date: Jun 2008 Posts: 103 Questions about solexa quality score！ reads.fq file: @4:1:518:715 GATACCATAAAAGCTGGATCCTTCTTCAAGCATAA +4:1:518:715 hhhhhhhhhhhhhhhdhhhhhhhhhhhdRehdhhP 1. How to change character (like 'e' or 'h') to quality score? 2. What's the meaning of this score? How to compute this score （ formula ）？
 06-17-2008, 07:26 AM #2 Farhat Member   Location: Pune, India Join Date: Apr 2008 Posts: 21 For a Fastq file, if the quality character is \$q the corresponding Phred quality can be calculated with the following Perl code: \$Q = ord(\$q) - 33;
 06-17-2008, 08:04 AM #3 SoupDragon Junior Member   Location: UK Join Date: Jun 2008 Posts: 1 This is correct if you are using quality scores encoded in "fastq" format. I believe the Illimina pipeline used a different ascii offset (64) according to their pipeline documentation. A value of zero = ascii 64 ('@'). The ascii value for a qv is therefore qv+64. So "h" = 104 - 64 = 40
 06-17-2008, 08:09 AM #4 Farhat Member   Location: Pune, India Join Date: Apr 2008 Posts: 21 Dupe. Deleted. Last edited by Farhat; 06-17-2008 at 08:20 AM.
06-17-2008, 08:10 AM   #5
Farhat
Member

Location: Pune, India

Join Date: Apr 2008
Posts: 21

Quote:
 Originally Posted by SoupDragon This is correct if you are using quality scores encoded in "fastq" format. I believe the Illimina pipeline used a different ascii offset (64) according to their pipeline documentation. A value of zero = ascii 64 ('@'). The ascii value for a qv is therefore qv+64. So "h" = 104 - 64 = 40
You are right. 'h' would make the quality way beyond 40 by my calculation.

 06-17-2008, 06:02 PM #6 baohua100 Senior Member   Location: Canada Join Date: Jun 2008 Posts: 103 Thanks. what's the range of this score ? (0---40 ?) what's the meaning of this score?
 06-17-2008, 06:07 PM #7 sparks Senior Member   Location: Kuala Lumpur, Malaysia Join Date: Mar 2008 Posts: 126 Solexa Quality Score The range is from -5 to 40 If P is probability of base then Solexa quality is 10 log10(P/(1-P)) A quality of -5 corresponds to P=0.25
06-18-2008, 08:36 AM   #8
Farhat
Member

Location: Pune, India

Join Date: Apr 2008
Posts: 21

Quote:
 Originally Posted by sparks The range is from -5 to 40 If P is probability of base then Solexa quality is 10 log10(P/(1-P)) A quality of -5 corresponds to P=0.25
In my datasets the range has been from -40 to 40.

 06-18-2008, 06:37 PM #9 sparks Senior Member   Location: Kuala Lumpur, Malaysia Join Date: Mar 2008 Posts: 126 Quality Score Range Farhats right for Solexa prb file formats from the base caller but for fastq format files the OP asked about, the range should be -5 to 40
06-19-2008, 11:15 AM   #10
Farhat
Member

Location: Pune, India

Join Date: Apr 2008
Posts: 21

Quote:
 Originally Posted by sparks Farhats right for Solexa prb file formats from the base caller but for fastq format files the OP asked about, the range should be -5 to 40
Yes, that's right, because for solexa PRB file the probability of A,C,G or T is given separately, and can be really low, whereas for fastq the lowest probability is 0.25 implying equal probability for any nucleotide.

 07-19-2008, 12:29 AM #11 baohua100 Senior Member   Location: Canada Join Date: Jun 2008 Posts: 103 \$sQ = -10 * log(\$e / (1 - \$e)) when \$sQ =40, \$e=0.0001 when \$sQ=0, \$e=0.5 0.5>0.25 when \$sQ=-4 \$e=0.72 what's the probalibity of error?????????????????????????????
 07-23-2008, 01:25 AM #12 sparks Senior Member   Location: Kuala Lumpur, Malaysia Join Date: Mar 2008 Posts: 126 If you are talking fastq format and have a quality of -4 then the probability of the base called is 0.28 and probability it is anyone of the other 3 bases is 0.72. If you see a -4 in a prb format file then the probability of the base is 0.28 and the other bases will each have their own prb/qual value.
 09-22-2008, 03:05 PM #13 mikertesz Junior Member   Location: Israel Join Date: Sep 2008 Posts: 1 "quala" files The output of a Solexa run generated a "quala" file of the following format: >sequence_0 40 40 40 19 7 40 40 40 40 40 31 40 40 40 40 40 40 40 40 40 40 11 40 40 40 36 40 12 40 21 39 1 4 40 40 15 40 40 4 40 40 10 40 40 40 40 40 2 4 10 1 >sequence_1 40 40 8 13 12 40 40 40 40 17 27 40 25 17 4 40 40 40 21 40 40 37 40 40 37 4 40 33 40 25 40 3 20 40 40 20 40 40 4 40 8 7 40 40 15 4 10 1 5 20 1 etc... Does anybody know what those numbers mean? Are those simply the Solexa quality scores per base-pair? The range seems to be 1-40 --- why isn't it -5 to 40 as in fasq?
 10-07-2008, 09:57 AM #14 vruotti Member   Location: US Join Date: Feb 2008 Posts: 13 Fastq file outside of GERALD Hi, Does anyone know an easy way or an existing program to convert all the .prb files from one particular lane into one fastq file? Similar to the s_1_sequence.txt file but with no filters applied? We have trying hacking around the Perl scripts within GERALD but looks like you need an intermediate seqpre.tmp file which I think gets deleted after the completion of GERALD. We know this is possible by just running GERALD with the fastq parameter. However, we would like to generate a fastq file that is not affected by GERALD's filters. That way we can set up our own quality filters. Any ideas? Do I go ahead and write one? Thanks, Victor
 10-07-2008, 10:11 AM #15 swbarnes2 Senior Member   Location: San Diego Join Date: May 2008 Posts: 912 I made my own very simple script, but here's a script of James Bonfield's here: http://seqanswers.com/forums/showthread.php?t=282 The only problem ithat I see is this line foreach (glob("\$fn/*seq.txt")) { which is going to get every single .seq in the directory, not just the ones from a single lane. So you'll have to fix that.
 10-07-2008, 10:38 AM #16 kmcarr Senior Member   Location: USA, Midwest Join Date: May 2008 Posts: 1,178 Victor, Run GERALD including the following line in the GERALD configuration file: QF_PARAMS '(1==1)' This is a conditional which is true 100% of the time; in other words, GERALD passes every read. (This technique comes from the Pipeline User Guide) Last edited by kmcarr; 10-07-2008 at 10:44 AM. Reason: To correct line and add attribution.
 10-10-2008, 09:05 AM #17 ShaunMahony Member   Location: University Park, PA Join Date: Apr 2008 Posts: 27 As said below (and also in the Solexa documentation), Solexa quality scores in their Fastq-like format are given by 10*log_10(P/(1-P)). I thought it might be useful for some people if I posted a lookup table based on this. Note I'm giving the probability that a base is erroneous, rounded to four decimal places. Please post a reply if you think this table is an incorrect translation: Char ASCII Char-64 P(error) ; 59 -5 0.7597 < 60 -4 0.7153 = 61 -3 0.6661 > 62 -2 0.6131 ? 63 -1 0.5573 @ 64 0 0.5000 A 65 1 0.4427 B 66 2 0.3869 C 67 3 0.3339 D 68 4 0.2847 E 69 5 0.2403 F 70 6 0.2008 G 71 7 0.1663 H 72 8 0.1368 I 73 9 0.1118 J 74 10 0.0909 K 75 11 0.0736 L 76 12 0.0594 M 77 13 0.0477 N 78 14 0.0383 O 79 15 0.0307 P 80 16 0.0245 Q 81 17 0.0196 R 82 18 0.0156 S 83 19 0.0124 T 84 20 0.0099 U 85 21 0.0079 V 86 22 0.0063 W 87 23 0.0050 X 88 24 0.0040 Y 89 25 0.0032 Z 90 26 0.0025 [ 91 27 0.0020 \ 92 28 0.0016 ] 93 29 0.0013 ^ 94 30 0.0010 _ 95 31 0.0008 ` 96 32 0.0006 a 97 33 0.0005 b 98 34 0.0004 c 99 35 0.0003 d 100 36 0.0003 e 101 37 0.0002 f 102 38 0.0002 g 103 39 0.0001 h 104 40 0.0001
 10-10-2008, 12:19 PM #18 vruotti Member   Location: US Join Date: Feb 2008 Posts: 13 More on Quality Hello, We are looking a little closer at the quality of one of our runs. Interestingly, we see a pattern in most of our runs right at the 30th cycle. The information from the graph below comes from the s_N_export.txt files. Please ignore the graph from lane 4. This was a failed lane. The others however, including our control (lane 8) show this pattern. This was an IPAR run with the upgraded GAII and was one of our best runs. Other runs also show this pattern at the 30th cycle. Does anyone know the reason why the qualities drop so much after the 30th cycle? Have you seem this before in any of your runs? Thanks in advance. Victor Last edited by vruotti; 10-10-2008 at 02:52 PM.
11-20-2008, 01:29 PM   #19
TylerBackman
Member

Location: Riverside, CA

Join Date: Oct 2008
Posts: 13

Quote:
 Originally Posted by vruotti Does anyone know the reason why the qualities drop so much after the 30th cycle?
This is most likely because the qualities you are looking at are alignment normalized, and a large number of your sequences failed to align to the reference genome (due to a ligated adapter, etc.)

Take a look at the un-normalized scores (s_<lane>_qraw.txt) instead, I think you'll find that the curve is more continuous between cycles.

 02-18-2009, 08:29 AM #20 baohua100 Senior Member   Location: Canada Join Date: Jun 2008 Posts: 103 fastq file: @I326_2_FC306FCAAXX:8:1:50:985 ATGTCCGAAGGGCAGTCTCAAGTGGTAAAATGGAT +I326_2_FC306FCAAXX:8:1:50:985 hhhWhhhchhhhhahShh\PO]LgXZXPNLUTZNO MAQ alignment output: I326_2_FC306FCAAXX:8:1:50:985 1 1 + 0 0 99 99 99 0 0 1 0 35 ATGTCCGAAGGGCAGTCTCAAGTGGTAAAATGGAT ```W```````````S``\PO]L`XZXPNLUTZNO what's the meaning of ```W```````````S``\PO]L`XZXPNLUTZNO ? not the same as hhhWhhhchhhhhahShh\PO]LgXZXPNLUTZNO

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules

All times are GMT -8. The time now is 01:52 PM.