![]() |
|
|||||||
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Ideas on collecting quality scores per base in an illumina fastq file | brachysclereid | Bioinformatics | 11 | 12-05-2011 01:00 PM |
| Illumina quality scores | dlepp | Illumina/Solexa | 6 | 02-28-2011 11:09 PM |
| Illumina 1.3 v 1.8 quality scores | Graham Etherington | Bioinformatics | 1 | 10-18-2010 07:00 AM |
| Sanger FASTQ Quality Scores | upper | Bioinformatics | 2 | 05-03-2010 07:20 PM |
| fastq quality scores | bioxyz | Bioinformatics | 2 | 11-25-2009 03:28 PM |
![]() |
|
|
Thread Tools |
|
|
#1 |
|
Member
Location: Europe Join Date: Apr 2010
Posts: 46
|
Hi there,
I'm having a look at some FASTQ files generated from the Illumina GA pipeline (I think version 1.3). The data is for a series of paired-end RNA-Seq runs, containing >100,000,000 reads in total. I'm just trying to get a feel for the information at the moment, and one of the first things I've noticed is that the quality scores are not what I expect. As I understand it (from the Wikipedia article on FastQ created by Torst), version 1.3+ of the GA pipeline encodes Phred quality scores from 0-62 using ASCII 64-126. Our files use the characters 66-98, which implies that all our bases have Phred qualities in the range 2 to 34 (inclusive). Also, none of the bases have a quality character of 67, implying no base has a Phred quality of 3 (even though 100,000s of bases have qualities 2 and 4-34). I'd appreciate it if someone could help answer the following: (1) does it seem reasonable that our qualities are being capped at 34? I notice a previous post has comments from maubp/kmcarr pointing out that the maximum scores might be capped at 34 by a particular version of Bustard (http://seqanswers.com/forums/showthread.php?t=4679) - perhaps this is what's happening? (2) is it normal to have a non-zero lower bound for observed quality scores (in our case, 2)? (3) is there an obvious reason why none of our bases has a quality of 3, even though every other quality in the range 2 to 34 is highly represented? I'm very new to this area, so I'm not sure what additional information would be helpful here. I should be able to get additional details on request, Thanks for your time! |
|
|
|
|
|
#2 | |||
|
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 802
|
Quote:
Quote:
Quote:
Please note that I have no real information to base any of these statements on. It's purely speculation. |
|||
|
|
|
|
|
#3 | |
|
Senior Member
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA Join Date: Apr 2008
Posts: 253
|
Quote:
"The Read Segment Quality Control Indicator: At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis. Some reads will end with a run of B (or Q2) basecalls, but there will never be an isolated Q2 basecall." |
|
|
|
|
|
|
#4 |
|
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,171
|
That's a very useful bit of information - thanks Torst!
Are you sure about which version of the pipeline this was introduced in? Here you wrote 1.3+ but on the wikipedia page said 1.5+ |
|
|
|
|
|
#5 | |
|
Senior Member
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA Join Date: Apr 2008
Posts: 253
|
Quote:
As "B" is very poor quality anyway, if you ever see an exact run of BBBBB at the 3' end of a read, it is very likely to be the read quality indicator, so should be trimmed. In older versions of the pipeline, you would be unlikely to see such clear runs of BBBBB anyway - it would be a mix of A,B,C and so on. |
|
|
|
|
|
|
#6 | |
|
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,171
|
Quote:
Here's a quick Biopython snippet to do this trimming: http://news.open-bio.org/news/2010/0...q2-trim-fastq/ |
|
|
|
|
|
|
#7 | |
|
Member
Location: Finland Join Date: Nov 2009
Posts: 18
|
Quote:
^``B`bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb`b`bcaaa`aab_a`caacca_]aaa`aaNY]]_]ab Also how the region is marked? It seems that biopython example removes everything from B onwards. I have several reads which have B ending reads. I would assume that the regions marked with continuous B are the bad ones and should be removed. This could (assuming that there is segment in a middle of read) split the read into two though. And how should I handle the read with quality score above? Last edited by Hena; 05-03-2010 at 12:30 AM. |
|
|
|
|
|
|
#8 | ||
|
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,171
|
Quote:
Quote:
What version of the Illumnia pipeline did your data come from? |
||
|
|
|
|
|
#9 | |
|
Senior Member
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA Join Date: Apr 2008
Posts: 253
|
Quote:
|
|
|
|
|
|
|
#10 |
|
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,171
|
So are any B characters in the middle simply PHRED quality 2 then? That was not clear to me - maybe as I suggested this data is from an older Illumina pipeline before the new use of B as a flag? Afterall the slide did say there would not be any isolated B characters (granted the final base likely be possible).
Last edited by maubp; 05-04-2010 at 12:38 AM. Reason: Clarity |
|
|
|
|
|
#11 |
|
Senior Member
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA Join Date: Apr 2008
Posts: 253
|
Actually, the pipeline guarantees there will NOT be any isolated Bs in the quality string anymore. Quality values 0,1,2 are no longer used. B=2 is only used at the END of reads in a contiguous fashion backwards. I suspect the disallowed B as a quality value now, so you can tell the difference between older pipleline quality strings (which use B for Q2) and recent pipelines which use it for the "read quality indicator".
|
|
|
|
|
|
#12 | |
|
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,171
|
Quote:
Do you agree that would mean Hena's data predates the new "Read Segment Quality Control Indicator" with the special meaning for character "B", and thus using "B trimming" as implemented in my script on it isn't a good idea. Some sort of window based average quality threshold filtering would be more sensible. However, if the FASTQ files from the latest Illumina pipeline really do only use B at the end of a read, my script is fine as is. Yes? (P.S. Apologies if my recent posts were too terse, typing on a mobile device is a pain - I'm back at a full keyboard now) Last edited by maubp; 05-04-2010 at 02:04 AM. |
|
|
|
|
|
|
#13 | ||
|
Senior Member
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA Join Date: Apr 2008
Posts: 253
|
maubp,
Quote:
Quote:
|
||
|
|
|
|
|
#14 | |
|
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,171
|
Quote:
Thanks for your advice Torst. Peter |
|
|
|
|
|
|
#15 |
|
Senior Member
Location: Cambridge, UK Join Date: Jul 2008
Posts: 126
|
Do they still emit varying quality values for N bases?
That always confused me. Most were 4 I think, but we'd occasionally see N with quality all the way up to 10. I can only assume they change bases to N at some stage, but don't do anything with the Q value. It seemed broken at the time anyway, but maybe it's a bit saner now. |
|
|
|
|
|
#16 | |
|
Senior Member
Location: Victorian Bioinformatics Consortium, Melbourne, AUSTRALIA Join Date: Apr 2008
Posts: 253
|
Quote:
I just wrote a quick Perl script to check how N is being qualitied on a recent Pipeline 1.6 for the first 2M reads of a random fastq file from the run (QVALUE => FREQUENCY): '6' => 7, '11' => 22, '7' => 57, '9' => 80, '12' => 18, '2' => 281517, '15' => 1, '14' => 5, '8' => 62, '4' => 51799, '13' => 3, '10' => 23, '5' => 72 As you can see, most are Q02, which is "B" and is part of the 'rejected section' of the read, so they can be ignored. Most of true Ns are Q4 ("D") as they were in your experience, however there are still smatterings of Ns with qualities all the way up to Q15 ! *sigh* |
|
|
|
|
|
|
#17 |
|
Member
Location: New York City Join Date: Aug 2009
Posts: 14
|
Before mapping and before subtracting 64, I checked the distribution of quality scores for my reads (PIPELINE 1.6). I noticed what everyone mentioned here (quality scores starting at 66 - 64 = 2).
However, I also noticed thousands of quality scores of 10 - 64 = -54. I thought negative quality scores were "phased out" according to the Wiki? What are these? More importantly, do they say anything about run quality? One end of my paired-end run has more -54 quality bases in the second end for every lane, what does that mean? Second question, do any of the current mapping programs (Bowtie, BWA, BFAST, SOAP, etc) automatically do end-clipping of "B" quality bases at ends of reads? I am guessing that the -54 scores are converted to zero. Cheers, Juan |
|
|
|
|
|
#18 |
|
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,171
|
Solexa's negative quality scores only went down to -5, so something else is going on.
Could you post a couple of reads with these funny quality scores? Wrap it in [ code ] and [ /code ] tags for display in the forum. |
|
|
|
|
|
#19 | |
|
Junior Member
Location: Oxford Join Date: Sep 2010
Posts: 1
|
Quote:
If I can understand what they have done here? -- they take low scoring bases and convert them to N (rather than calling the highest signal with a low score) ? -- when you align these reads are the N's counted as errors ?? or ignored ?? |
|
|
|
|
|
|
#20 |
|
Member
Location: New York City Join Date: Aug 2009
Posts: 14
|
[QUOTE=maubp;25709]Solexa's negative quality scores only went down to -5, so something else is going on.
I figured it out. 10 is the ASCII code for newline. bug in code not bizarre quality score. |
|
|
|
![]() |
| Tags |
| illumina, phred, quality, score |
| Thread Tools | |
|
|