SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Threshold quality score to determine the quality read of ILLUMINA reads problem edge Illumina/Solexa 35 11-02-2015 10:31 AM
SNP phred-score threshold shuang Bioinformatics 0 08-10-2011 02:57 PM
Threshold quality score to determine the quality read of ILLUMINA reads problem edge General 1 09-13-2010 02:22 PM
SNP calling - is there an accepted Phred quality threshold? Francesco Lescai Bioinformatics 3 04-13-2010 11:51 AM
Fastq quliaty score and MAQ output quality score baohua100 Bioinformatics 1 02-19-2009 09:21 AM

Reply
 
Thread Tools
Old 04-22-2008, 05:42 AM   #1
chris
Member
 
Location: Dundee, Scotland

Join Date: Apr 2008
Posts: 52
Question Quality score threshold?

Hi everyone,

This is my first post, so please be gentle

I'm working on some Solexa data for a collaborator and have noticed that the quality (as determined by matches to the genome) of the sequence reads drop off very quickly beyond ~25nt.

Now that I have the actual Solexa read quality scores what kind of cut-offs to people use for throwing out 'junk' reads? Some informal discussions have suggested scores 30 and above..
Any thoughts?
Thanks,

Chris.
chris is offline   Reply With Quote
Old 04-23-2008, 10:14 AM   #2
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

I am interrested in this also, have you compared quality scores for the misalignad bases at different positions in the reads?

I guess the sequences to use depends on the application, if you are only counting aligned positions (ChIP-seq, transcripome etc) it doesn't matter if the last part is crap as long as the alignment is correct.
Chipper is offline   Reply With Quote
Old 04-23-2008, 11:03 AM   #3
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

We often run multiple Eland alignments, using 32,29, 26 and 23 (or some variation of the above), and then identify the longest read where we get a unique match. This somewhat ignores your question.

As chipper pointed out, this is really a strategy that's applicable for chip-seq or possibly transcriptome data. I wouldn't apply it to all analyses.

There's a formula for coverting the quality score to probability, though, which you could use to figure out what probability you're comfortable with.

I believe it is P =1 / (1 + 10^(-Q/10))

You might want to confirm that, however, before you do anything with it. (I only obtained the equation second hand.)
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 04-24-2008, 12:31 AM   #4
chris
Member
 
Location: Dundee, Scotland

Join Date: Apr 2008
Posts: 52
Default

Thanks for the replies.

Chipper. No I haven't looked at the scores for misaligned bases as in our analysis so far we only looked at two mismatches. It appears are the poor quality is concentrated at the 3' by up to 6nt so our analysis doesn't find matches with these.

apfejes. Is the formula you mean the same as for converting Solexa scores to Phred scores as shown here: http://maq.sourceforge.net/fastq.shtml

I think the best course of action will be to test various cut-offs and see what I get. I'll post back here if I get anything useful.
Cheers.
chris is offline   Reply With Quote
Old 04-24-2008, 08:06 AM   #5
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

chris,

The formula on that page is related, but not identical. (Obviously, since they're both performing very similar transformations.) However, I was referring to the older format prb files, which contain values between -40 and 40, whereas the version you've indicated is used in the new eland pipeline. (I can't recall version numbers for them off hand.)

If your probabilities are displayed in a format consistent with what's on that page, however, then it's most likely the correct format to use. If you are using the old-style prb files, where each base is represented by four numbers, then the version I've written above is more likely to be correct.

Cheers,
Anthony
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 04-25-2008, 12:14 AM   #6
chris
Member
 
Location: Dundee, Scotland

Join Date: Apr 2008
Posts: 52
Default

Hi Anthony,

The scores I have range from -5 to 40 which I believe is the current Solexa Genome Analyser quality score range, so I guess I'll stick with the 'new' formula.
Thanks,

Chris.
chris is offline   Reply With Quote
Old 04-28-2008, 07:52 AM   #7
chris
Member
 
Location: Dundee, Scotland

Join Date: Apr 2008
Posts: 52
Default

Right. A quick looksee of the raw Solexa quality scores at a variety of cut-offs gives:

Code:
Q Cut-off    Frequency
0            584079
5            641244
10           406655
15           179174
20            63783
25            20300
30             6389
35             3454
The frequency counts are for the number of sequence reads whose quality scores are *all* above the cut-off. Each sequence is only counted once and binned at the highest cut-off which it satisfies.

I'm a bit worried as the majority of the data has quality score of <10. This is equivalent to a Phred score of 10.4 or 90% accuracy

Does anyone else get this kind of quality or is this really a bad run?
chris is offline   Reply With Quote
Old 04-28-2008, 09:01 AM   #8
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Chris,
Is this all 8 lanes data? Did you convert back the prb from Solexa to Q value? by *all* you mean the entire 36bp read?
Let me know and I can get similar quality scores for the data.
bioinfosm is offline   Reply With Quote
Old 04-29-2008, 12:43 AM   #9
chris
Member
 
Location: Dundee, Scotland

Join Date: Apr 2008
Posts: 52
Default

I'm not exactly sure. This data is kind of second hand and from a file called 's_3_sequence.txt'. There are 2.2M reads in the form:
Code:
HWI-EAS111_2:3:17:156:119:AGTGAGGTAGTAGATTGTATAGTTTCGTATGCC:23 40 40 40 40 40 40 40 26 40 40 40 40 21 40 40 40 40 40 40 40 33 40 40 40 31 29 40 11 38 7 35 22
And these are all 33bp reads.
Thanks for your help bioinfosm
chris is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:48 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO