SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Ideas on collecting quality scores per base in an illumina fastq file brachysclereid Bioinformatics 11 12-05-2011 02:00 PM
Illumina quality scores dlepp Illumina/Solexa 6 03-01-2011 12:09 AM
Illumina 1.3 v 1.8 quality scores Graham Etherington Bioinformatics 1 10-18-2010 08:00 AM
Sanger FASTQ Quality Scores upper Bioinformatics 2 05-03-2010 08:20 PM
fastq quality scores bioxyz Bioinformatics 2 11-25-2009 04:28 PM

Reply
 
Thread Tools
Old 04-14-2010, 10:33 AM   #1
Bio.X2Y
Member
 
Location: Europe

Join Date: Apr 2010
Posts: 46
Default Illumina FASTQ Quality Scores - Missing Value

Hi there,

I'm having a look at some FASTQ files generated from the Illumina GA pipeline (I think version 1.3). The data is for a series of paired-end RNA-Seq runs, containing >100,000,000 reads in total.

I'm just trying to get a feel for the information at the moment, and one of the first things I've noticed is that the quality scores are not what I expect.

As I understand it (from the Wikipedia article on FastQ created by Torst), version 1.3+ of the GA pipeline encodes Phred quality scores from 0-62 using ASCII 64-126.

Our files use the characters 66-98, which implies that all our bases have Phred qualities in the range 2 to 34 (inclusive). Also, none of the bases have a quality character of 67, implying no base has a Phred quality of 3 (even though 100,000s of bases have qualities 2 and 4-34).

I'd appreciate it if someone could help answer the following:

(1) does it seem reasonable that our qualities are being capped at 34? I notice a previous post has comments from maubp/kmcarr pointing out that the maximum scores might be capped at 34 by a particular version of Bustard (http://seqanswers.com/forums/showthread.php?t=4679) - perhaps this is what's happening?

(2) is it normal to have a non-zero lower bound for observed quality scores (in our case, 2)?

(3) is there an obvious reason why none of our bases has a quality of 3, even though every other quality in the range 2 to 34 is highly represented?

I'm very new to this area, so I'm not sure what additional information would be helpful here. I should be able to get additional details on request,

Thanks for your time!
Bio.X2Y is offline   Reply With Quote
Old 04-14-2010, 01:31 PM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 990
Default

Quote:
Originally Posted by Bio.X2Y View Post
Hi there,

I'm having a look at some FASTQ files generated from the Illumina GA pipeline (I think version 1.3). The data is for a series of paired-end RNA-Seq runs, containing >100,000,000 reads in total.

I'm just trying to get a feel for the information at the moment, and one of the first things I've noticed is that the quality scores are not what I expect.

As I understand it (from the Wikipedia article on FastQ created by Torst), version 1.3+ of the GA pipeline encodes Phred quality scores from 0-62 using ASCII 64-126.

Our files use the characters 66-98, which implies that all our bases have Phred qualities in the range 2 to 34 (inclusive). Also, none of the bases have a quality character of 67, implying no base has a Phred quality of 3 (even though 100,000s of bases have qualities 2 and 4-34).

I'd appreciate it if someone could help answer the following:

(1) does it seem reasonable that our qualities are being capped at 34? I notice a previous post has comments from maubp/kmcarr pointing out that the maximum scores might be capped at 34 by a particular version of Bustard (http://seqanswers.com/forums/showthread.php?t=4679) - perhaps this is what's happening?
I guess I'll jump in here too. I don't know in which version of the pipeline Illumina started capping the Q score at 34. I mentioned v 1.6 because in that version of RTA they added a Q score plot to their status displays making the upper bound quite obvious. Since they have been using the same Q score scheme since v 1.3 I wouldn't be surprised if it was bounded there.

Quote:
(2) is it normal to have a non-zero lower bound for observed quality scores (in our case, 2)?
I also see a lower limit of 2 in my data. Again, I'm guessing it is some bound imposed by their algorithm.

Quote:
(3) is there an obvious reason why none of our bases has a quality of 3, even though every other quality in the range 2 to 34 is highly represented?
I have seen this as well. I assumed it was some artifact of their algorithm which causes a discontinuity. In other words, you're not alone so you probably shouldn't worry about it too much.

Please note that I have no real information to base any of these statements on. It's purely speculation.
kmcarr is offline   Reply With Quote
Old 04-24-2010, 09:21 PM   #3
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default

Quote:
Originally Posted by Bio.X2Y View Post
I'm having a look at some FASTQ files generated from the Illumina GA pipeline (I think version 1.3).
(2) is it normal to have a non-zero lower bound for observed quality scores (in our case, 2)?
(3) is there an obvious reason why none of our bases has a quality of 3, even though every other quality in the range 2 to 34 is highly represented?
Perhaps it is related to this new 'feature' of Pipeline 1.3+? See SLIDE 17 in http://docs.google.com/fileview?id=0...NTUyNDE3&hl=en. Here is the text of the slide:

"The Read Segment Quality Control Indicator: At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis. Some reads will end with a run of B (or Q2) basecalls, but there will never be an isolated Q2 basecall."
Torst is offline   Reply With Quote
Old 04-29-2010, 05:26 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,399
Default

That's a very useful bit of information - thanks Torst!

Are you sure about which version of the pipeline this was introduced in? Here you wrote 1.3+ but on the wikipedia page said 1.5+
maubp is offline   Reply With Quote
Old 04-29-2010, 05:19 PM   #5
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default

Quote:
Originally Posted by maubp View Post
That's a very useful bit of information - thanks Torst! Are you sure about which version of the pipeline this was introduced in? Here you wrote 1.3+ but on the wikipedia page said 1.5+
I'm the one who actually wrote the original Wiki page for FastQ and I added the BBBB stuff a couple of days ago. I am actually not sure now when it was added - I wrote 1.3+ here, but I now think it was either 1.4 or 1.5, so I wrote 1.5+ in Wikipedia. Another person came and adjusted the page after me and didn't change the 1.5, so hopefully they agreed with my guesstimate.

As "B" is very poor quality anyway, if you ever see an exact run of BBBBB at the 3' end of a read, it is very likely to be the read quality indicator, so should be trimmed. In older versions of the pipeline, you would be unlikely to see such clear runs of BBBBB anyway - it would be a mix of A,B,C and so on.
Torst is offline   Reply With Quote
Old 04-30-2010, 12:50 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,399
Default

Quote:
Originally Posted by Torst View Post
I'm the one who actually wrote the original Wiki page for FastQ and I added the BBBB stuff a couple of days ago. I am actually not sure now when it was added - I wrote 1.3+ here, but I now think it was either 1.4 or 1.5, so I wrote 1.5+ in Wikipedia.
My question was perhaps unclear - I checked the history and saw you'd put 1.5+ on the wiki. But as you point out, either way people will probably trim off reads at Q2 level anyway.

Here's a quick Biopython snippet to do this trimming:
http://news.open-bio.org/news/2010/0...q2-trim-fastq/
maubp is offline   Reply With Quote
Old 05-03-2010, 01:23 AM   #7
Hena
Member
 
Location: Finland

Join Date: Nov 2009
Posts: 19
Default

Quote:
Originally Posted by Torst View Post
Perhaps it is related to this new 'feature' of Pipeline 1.3+? See SLIDE 17 in http://docs.google.com/fileview?id=0...NTUyNDE3&hl=en. Here is the text of the slide:

"The Read Segment Quality Control Indicator: At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis. Some reads will end with a run of B (or Q2) basecalls, but there will never be an isolated Q2 basecall."
What does that "never be an isolated Q2 basecall" really mean? I have this quality score

^``B`bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb`b`bcaaa`aab_a`caacca_]aaa`aaNY]]_]ab

Also how the region is marked? It seems that biopython example removes everything from B onwards. I have several reads which have B ending reads. I would assume that the regions marked with continuous B are the bad ones and should be removed. This could (assuming that there is segment in a middle of read) split the read into two though. And how should I handle the read with quality score above?

Last edited by Hena; 05-03-2010 at 01:30 AM.
Hena is offline   Reply With Quote
Old 05-03-2010, 10:06 AM   #8
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,399
Default

Quote:
Originally Posted by Hena View Post
What does that "never be an isolated Q2 basecall" really mean? I have this quality score

^``B`bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb`b`bcaaa`aab_a`caacca_]aaa`aaNY]]_]ab
That looks like an isolated B character.
Quote:
Originally Posted by Hena View Post

Also how the region is marked? It seems that biopython example removes everything from B onwards.
well yes, that's what I wrote the script to, and what I claimed it would do.

What version of the Illumnia pipeline did your data come from?
maubp is offline   Reply With Quote
Old 05-03-2010, 06:25 PM   #9
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default

Quote:
Originally Posted by Hena View Post
What does that "never be an isolated Q2 basecall" really mean? I have this quality score

^``B`bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb`b`bcaaa`aab_a`caacca_]aaa`aaNY]]_]ab

Also how the region is marked? It seems that biopython example removes everything from B onwards. I have several reads which have B ending reads. I would assume that the regions marked with continuous B are the bad ones and should be removed. This could (assuming that there is segment in a middle of read) split the read into two though. And how should I handle the read with quality score above?
The script provided is not doing the correct thing. You can have an isolated B, but ONLY as the LAST base in the read. Any RUN of Bs starting at the END of the read and working backwards are the bad ones.
Torst is offline   Reply With Quote
Old 05-04-2010, 01:27 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,399
Default

Quote:
Originally Posted by Torst View Post
The script provided is not doing the correct thing. You can have an isolated B, but ONLY as the LAST base in the read. Any RUN of Bs starting at the END of the read and working backwards are the bad ones.
So are any B characters in the middle simply PHRED quality 2 then? That was not clear to me - maybe as I suggested this data is from an older Illumina pipeline before the new use of B as a flag? Afterall the slide did say there would not be any isolated B characters (granted the final base likely be possible).

Last edited by maubp; 05-04-2010 at 01:38 AM. Reason: Clarity
maubp is offline   Reply With Quote
Old 05-04-2010, 01:36 AM   #11
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default

Quote:
Originally Posted by maubp View Post
So are any B characters in the middle simply PHRED quality 2 then? That was not clear to me.
Actually, the pipeline guarantees there will NOT be any isolated Bs in the quality string anymore. Quality values 0,1,2 are no longer used. B=2 is only used at the END of reads in a contiguous fashion backwards. I suspect the disallowed B as a quality value now, so you can tell the difference between older pipleline quality strings (which use B for Q2) and recent pipelines which use it for the "read quality indicator".
Torst is offline   Reply With Quote
Old 05-04-2010, 03:02 AM   #12
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,399
Default

Quote:
Originally Posted by Torst View Post
Actually, the pipeline guarantees there will NOT be any isolated Bs in the quality string anymore. Quality values 0,1,2 are no longer used. B=2 is only used at the END of reads in a contiguous fashion backwards. I suspect the disallowed B as a quality value now, so you can tell the difference between older pipleline quality strings (which use B for Q2) and recent pipelines which use it for the "read quality indicator".
That makes sense.

Do you agree that would mean Hena's data predates the new "Read Segment Quality Control Indicator" with the special meaning for character "B", and thus using "B trimming" as implemented in my script on it isn't a good idea. Some sort of window based average quality threshold filtering would be more sensible.

However, if the FASTQ files from the latest Illumina pipeline really do only use B at the end of a read, my script is fine as is. Yes?

(P.S. Apologies if my recent posts were too terse, typing on a mobile device is a pain - I'm back at a full keyboard now)

Last edited by maubp; 05-04-2010 at 03:04 AM.
maubp is offline   Reply With Quote
Old 05-04-2010, 08:33 PM   #13
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default

maubp,

Quote:
Originally Posted by maubp View Post
That makes sense. Do you agree that would mean Hena's data predates the new "Read Segment Quality Control Indicator" with the special meaning for character "B", and thus using "B trimming" as implemented in my script on it isn't a good idea. Some sort of window based average quality threshold filtering would be more sensible.
Yes, for reads from older pipelines that don't use the RSQC, your script is not useful.

Quote:
However, if the FASTQ files from the latest Illumina pipeline really do only use B at the end of a read, my script is fine as is. Yes?
Correct - for data from recent pipelines that do use the RSQC, your script will do the right thing. But if you modify it to only work if the Bs are anchored to the end of the read, it will still do the same thing for new reads (as no isolated Bs) and won't do too much damage if used on older reads (the Bs, which are awful quality 2 anyway, would have to be at the end). I always prefer to program defensively.
Torst is offline   Reply With Quote
Old 05-05-2010, 12:49 AM   #14
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,399
Default

Quote:
Originally Posted by Torst View Post
Correct - for data from recent pipelines that do use the RSQC, your script will do the right thing. But if you modify it to only work if the Bs are anchored to the end of the read, it will still do the same thing for new reads (as no isolated Bs) and won't do too much damage if used on older reads (the Bs, which are awful quality 2 anyway, would have to be at the end). I always prefer to program defensively.
I quite agree about defensive programming, and have worked out a neat way to do this alternative (trim trailing B characters only). I'll update the blog post.

Thanks for your advice Torst.

Peter
maubp is offline   Reply With Quote
Old 05-05-2010, 01:12 AM   #15
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 138
Default

Do they still emit varying quality values for N bases?

That always confused me. Most were 4 I think, but we'd occasionally see N with quality all the way up to 10. I can only assume they change bases to N at some stage, but don't do anything with the Q value. It seemed broken at the time anyway, but maybe it's a bit saner now.
jkbonfield is offline   Reply With Quote
Old 05-05-2010, 06:55 PM   #16
Torst
Senior Member
 
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 274
Default

Quote:
Originally Posted by jkbonfield View Post
Do they still emit varying quality values for N bases? That always confused me. Most were 4 I think, but we'd occasionally see N with quality all the way up to 10. I can only assume they change bases to N at some stage, but don't do anything with the Q value. It seemed broken at the time anyway, but maybe it's a bit saner now.
I agree with your experience. I never understood what an "N" of quality "10" actually meant; Q10 means probability of error is 10% so does that mean there is a 90% chance it is not an "N" ? :-)

I just wrote a quick Perl script to check how N is being qualitied on a recent Pipeline 1.6 for the first 2M reads of a random fastq file from the run (QVALUE => FREQUENCY):

'6' => 7,
'11' => 22,
'7' => 57,
'9' => 80,
'12' => 18,
'2' => 281517,
'15' => 1,
'14' => 5,
'8' => 62,
'4' => 51799,
'13' => 3,
'10' => 23,
'5' => 72

As you can see, most are Q02, which is "B" and is part of the 'rejected section' of the read, so they can be ignored. Most of true Ns are Q4 ("D") as they were in your experience, however there are still smatterings of Ns with qualities all the way up to Q15 !

*sigh*
Torst is offline   Reply With Quote
Old 09-21-2010, 01:25 PM   #17
juan
Member
 
Location: New York City

Join Date: Aug 2009
Posts: 14
Default Quality score of -54(10)

Before mapping and before subtracting 64, I checked the distribution of quality scores for my reads (PIPELINE 1.6). I noticed what everyone mentioned here (quality scores starting at 66 - 64 = 2).

However, I also noticed thousands of quality scores of 10 - 64 = -54. I thought negative quality scores were "phased out" according to the Wiki? What are these? More importantly, do they say anything about run quality? One end of my paired-end run has more -54 quality bases in the second end for every lane, what does that mean?

Second question, do any of the current mapping programs (Bowtie, BWA, BFAST, SOAP, etc) automatically do end-clipping of "B" quality bases at ends of reads? I am guessing that the -54 scores are converted to zero.

Cheers,
Juan
juan is offline   Reply With Quote
Old 09-22-2010, 02:01 AM   #18
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,399
Default

Solexa's negative quality scores only went down to -5, so something else is going on.

Could you post a couple of reads with these funny quality scores? Wrap it in [ code ] and [ /code ] tags for display in the forum.
maubp is offline   Reply With Quote
Old 09-22-2010, 02:25 AM   #19
clive
Junior Member
 
Location: Oxford

Join Date: Sep 2010
Posts: 1
Unhappy

Quote:
Originally Posted by jkbonfield View Post
Do they still emit varying quality values for N bases?

That always confused me. Most were 4 I think, but we'd occasionally see N with quality all the way up to 10. I can only assume they change bases to N at some stage, but don't do anything with the Q value. It seemed broken at the time anyway, but maybe it's a bit saner now.

If I can understand what they have done here? -- they take low scoring bases and convert them to N (rather than calling the highest signal with a low score) ? -- when you align these reads are the N's counted as errors ?? or ignored ??
clive is offline   Reply With Quote
Old 09-22-2010, 06:31 AM   #20
juan
Member
 
Location: New York City

Join Date: Aug 2009
Posts: 14
Default

[QUOTE=maubp;25709]Solexa's negative quality scores only went down to -5, so something else is going on.

I figured it out. 10 is the ASCII code for newline. bug in code not bizarre quality score.
juan is offline   Reply With Quote
Reply

Tags
illumina, phred, quality, score

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:26 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.