Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina FASTQ Quality Scores - Missing Value

    Hi there,

    I'm having a look at some FASTQ files generated from the Illumina GA pipeline (I think version 1.3). The data is for a series of paired-end RNA-Seq runs, containing >100,000,000 reads in total.

    I'm just trying to get a feel for the information at the moment, and one of the first things I've noticed is that the quality scores are not what I expect.

    As I understand it (from the Wikipedia article on FastQ created by Torst), version 1.3+ of the GA pipeline encodes Phred quality scores from 0-62 using ASCII 64-126.

    Our files use the characters 66-98, which implies that all our bases have Phred qualities in the range 2 to 34 (inclusive). Also, none of the bases have a quality character of 67, implying no base has a Phred quality of 3 (even though 100,000s of bases have qualities 2 and 4-34).

    I'd appreciate it if someone could help answer the following:

    (1) does it seem reasonable that our qualities are being capped at 34? I notice a previous post has comments from maubp/kmcarr pointing out that the maximum scores might be capped at 34 by a particular version of Bustard (http://seqanswers.com/forums/showthread.php?t=4679) - perhaps this is what's happening?

    (2) is it normal to have a non-zero lower bound for observed quality scores (in our case, 2)?

    (3) is there an obvious reason why none of our bases has a quality of 3, even though every other quality in the range 2 to 34 is highly represented?

    I'm very new to this area, so I'm not sure what additional information would be helpful here. I should be able to get additional details on request,

    Thanks for your time!

  • #2
    Originally posted by Bio.X2Y View Post
    Hi there,

    I'm having a look at some FASTQ files generated from the Illumina GA pipeline (I think version 1.3). The data is for a series of paired-end RNA-Seq runs, containing >100,000,000 reads in total.

    I'm just trying to get a feel for the information at the moment, and one of the first things I've noticed is that the quality scores are not what I expect.

    As I understand it (from the Wikipedia article on FastQ created by Torst), version 1.3+ of the GA pipeline encodes Phred quality scores from 0-62 using ASCII 64-126.

    Our files use the characters 66-98, which implies that all our bases have Phred qualities in the range 2 to 34 (inclusive). Also, none of the bases have a quality character of 67, implying no base has a Phred quality of 3 (even though 100,000s of bases have qualities 2 and 4-34).

    I'd appreciate it if someone could help answer the following:

    (1) does it seem reasonable that our qualities are being capped at 34? I notice a previous post has comments from maubp/kmcarr pointing out that the maximum scores might be capped at 34 by a particular version of Bustard (http://seqanswers.com/forums/showthread.php?t=4679) - perhaps this is what's happening?
    I guess I'll jump in here too. I don't know in which version of the pipeline Illumina started capping the Q score at 34. I mentioned v 1.6 because in that version of RTA they added a Q score plot to their status displays making the upper bound quite obvious. Since they have been using the same Q score scheme since v 1.3 I wouldn't be surprised if it was bounded there.

    (2) is it normal to have a non-zero lower bound for observed quality scores (in our case, 2)?
    I also see a lower limit of 2 in my data. Again, I'm guessing it is some bound imposed by their algorithm.

    (3) is there an obvious reason why none of our bases has a quality of 3, even though every other quality in the range 2 to 34 is highly represented?
    I have seen this as well. I assumed it was some artifact of their algorithm which causes a discontinuity. In other words, you're not alone so you probably shouldn't worry about it too much.

    Please note that I have no real information to base any of these statements on. It's purely speculation.

    Comment


    • #3
      Originally posted by Bio.X2Y View Post
      I'm having a look at some FASTQ files generated from the Illumina GA pipeline (I think version 1.3).
      (2) is it normal to have a non-zero lower bound for observed quality scores (in our case, 2)?
      (3) is there an obvious reason why none of our bases has a quality of 3, even though every other quality in the range 2 to 34 is highly represented?
      Perhaps it is related to this new 'feature' of Pipeline 1.3+? See SLIDE 17 in http://docs.google.com/fileview?id=0...NTUyNDE3&hl=en. Here is the text of the slide:

      "The Read Segment Quality Control Indicator: At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis. Some reads will end with a run of B (or Q2) basecalls, but there will never be an isolated Q2 basecall."

      Comment


      • #4
        That's a very useful bit of information - thanks Torst!

        Are you sure about which version of the pipeline this was introduced in? Here you wrote 1.3+ but on the wikipedia page said 1.5+

        Comment


        • #5
          Originally posted by maubp View Post
          That's a very useful bit of information - thanks Torst! Are you sure about which version of the pipeline this was introduced in? Here you wrote 1.3+ but on the wikipedia page said 1.5+
          I'm the one who actually wrote the original Wiki page for FastQ and I added the BBBB stuff a couple of days ago. I am actually not sure now when it was added - I wrote 1.3+ here, but I now think it was either 1.4 or 1.5, so I wrote 1.5+ in Wikipedia. Another person came and adjusted the page after me and didn't change the 1.5, so hopefully they agreed with my guesstimate.

          As "B" is very poor quality anyway, if you ever see an exact run of BBBBB at the 3' end of a read, it is very likely to be the read quality indicator, so should be trimmed. In older versions of the pipeline, you would be unlikely to see such clear runs of BBBBB anyway - it would be a mix of A,B,C and so on.

          Comment


          • #6
            Originally posted by Torst View Post
            I'm the one who actually wrote the original Wiki page for FastQ and I added the BBBB stuff a couple of days ago. I am actually not sure now when it was added - I wrote 1.3+ here, but I now think it was either 1.4 or 1.5, so I wrote 1.5+ in Wikipedia.
            My question was perhaps unclear - I checked the history and saw you'd put 1.5+ on the wiki. But as you point out, either way people will probably trim off reads at Q2 level anyway.

            Here's a quick Biopython snippet to do this trimming:

            Comment


            • #7
              Originally posted by Torst View Post
              Perhaps it is related to this new 'feature' of Pipeline 1.3+? See SLIDE 17 in http://docs.google.com/fileview?id=0...NTUyNDE3&hl=en. Here is the text of the slide:

              "The Read Segment Quality Control Indicator: At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis. Some reads will end with a run of B (or Q2) basecalls, but there will never be an isolated Q2 basecall."
              What does that "never be an isolated Q2 basecall" really mean? I have this quality score

              ^``B`bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb`b`bcaaa`aab_a`caacca_]aaa`aaNY]]_]ab

              Also how the region is marked? It seems that biopython example removes everything from B onwards. I have several reads which have B ending reads. I would assume that the regions marked with continuous B are the bad ones and should be removed. This could (assuming that there is segment in a middle of read) split the read into two though. And how should I handle the read with quality score above?
              Last edited by Hena; 05-03-2010, 12:30 AM.

              Comment


              • #8
                Originally posted by Hena View Post
                What does that "never be an isolated Q2 basecall" really mean? I have this quality score

                ^``B`bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb`b`bcaaa`aab_a`caacca_]aaa`aaNY]]_]ab
                That looks like an isolated B character.
                Originally posted by Hena View Post

                Also how the region is marked? It seems that biopython example removes everything from B onwards.
                well yes, that's what I wrote the script to, and what I claimed it would do.

                What version of the Illumnia pipeline did your data come from?

                Comment


                • #9
                  Originally posted by Hena View Post
                  What does that "never be an isolated Q2 basecall" really mean? I have this quality score

                  ^``B`bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb`b`bcaaa`aab_a`caacca_]aaa`aaNY]]_]ab

                  Also how the region is marked? It seems that biopython example removes everything from B onwards. I have several reads which have B ending reads. I would assume that the regions marked with continuous B are the bad ones and should be removed. This could (assuming that there is segment in a middle of read) split the read into two though. And how should I handle the read with quality score above?
                  The script provided is not doing the correct thing. You can have an isolated B, but ONLY as the LAST base in the read. Any RUN of Bs starting at the END of the read and working backwards are the bad ones.

                  Comment


                  • #10
                    Originally posted by Torst View Post
                    The script provided is not doing the correct thing. You can have an isolated B, but ONLY as the LAST base in the read. Any RUN of Bs starting at the END of the read and working backwards are the bad ones.
                    So are any B characters in the middle simply PHRED quality 2 then? That was not clear to me - maybe as I suggested this data is from an older Illumina pipeline before the new use of B as a flag? Afterall the slide did say there would not be any isolated B characters (granted the final base likely be possible).
                    Last edited by maubp; 05-04-2010, 12:38 AM. Reason: Clarity

                    Comment


                    • #11
                      Originally posted by maubp View Post
                      So are any B characters in the middle simply PHRED quality 2 then? That was not clear to me.
                      Actually, the pipeline guarantees there will NOT be any isolated Bs in the quality string anymore. Quality values 0,1,2 are no longer used. B=2 is only used at the END of reads in a contiguous fashion backwards. I suspect the disallowed B as a quality value now, so you can tell the difference between older pipleline quality strings (which use B for Q2) and recent pipelines which use it for the "read quality indicator".

                      Comment


                      • #12
                        Originally posted by Torst View Post
                        Actually, the pipeline guarantees there will NOT be any isolated Bs in the quality string anymore. Quality values 0,1,2 are no longer used. B=2 is only used at the END of reads in a contiguous fashion backwards. I suspect the disallowed B as a quality value now, so you can tell the difference between older pipleline quality strings (which use B for Q2) and recent pipelines which use it for the "read quality indicator".
                        That makes sense.

                        Do you agree that would mean Hena's data predates the new "Read Segment Quality Control Indicator" with the special meaning for character "B", and thus using "B trimming" as implemented in my script on it isn't a good idea. Some sort of window based average quality threshold filtering would be more sensible.

                        However, if the FASTQ files from the latest Illumina pipeline really do only use B at the end of a read, my script is fine as is. Yes?

                        (P.S. Apologies if my recent posts were too terse, typing on a mobile device is a pain - I'm back at a full keyboard now)
                        Last edited by maubp; 05-04-2010, 02:04 AM.

                        Comment


                        • #13
                          maubp,

                          Originally posted by maubp View Post
                          That makes sense. Do you agree that would mean Hena's data predates the new "Read Segment Quality Control Indicator" with the special meaning for character "B", and thus using "B trimming" as implemented in my script on it isn't a good idea. Some sort of window based average quality threshold filtering would be more sensible.
                          Yes, for reads from older pipelines that don't use the RSQC, your script is not useful.

                          However, if the FASTQ files from the latest Illumina pipeline really do only use B at the end of a read, my script is fine as is. Yes?
                          Correct - for data from recent pipelines that do use the RSQC, your script will do the right thing. But if you modify it to only work if the Bs are anchored to the end of the read, it will still do the same thing for new reads (as no isolated Bs) and won't do too much damage if used on older reads (the Bs, which are awful quality 2 anyway, would have to be at the end). I always prefer to program defensively.

                          Comment


                          • #14
                            Originally posted by Torst View Post
                            Correct - for data from recent pipelines that do use the RSQC, your script will do the right thing. But if you modify it to only work if the Bs are anchored to the end of the read, it will still do the same thing for new reads (as no isolated Bs) and won't do too much damage if used on older reads (the Bs, which are awful quality 2 anyway, would have to be at the end). I always prefer to program defensively.
                            I quite agree about defensive programming, and have worked out a neat way to do this alternative (trim trailing B characters only). I'll update the blog post.

                            Thanks for your advice Torst.

                            Peter

                            Comment


                            • #15
                              Do they still emit varying quality values for N bases?

                              That always confused me. Most were 4 I think, but we'd occasionally see N with quality all the way up to 10. I can only assume they change bases to N at some stage, but don't do anything with the Q value. It seemed broken at the time anyway, but maybe it's a bit saner now.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X