Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • chris
    Member
    • Apr 2008
    • 52

    Quality score threshold?

    Hi everyone,

    This is my first post, so please be gentle

    I'm working on some Solexa data for a collaborator and have noticed that the quality (as determined by matches to the genome) of the sequence reads drop off very quickly beyond ~25nt.

    Now that I have the actual Solexa read quality scores what kind of cut-offs to people use for throwing out 'junk' reads? Some informal discussions have suggested scores 30 and above..
    Any thoughts?
    Thanks,

    Chris.
  • Chipper
    Senior Member
    • Mar 2008
    • 323

    #2
    I am interrested in this also, have you compared quality scores for the misalignad bases at different positions in the reads?

    I guess the sequences to use depends on the application, if you are only counting aligned positions (ChIP-seq, transcripome etc) it doesn't matter if the last part is crap as long as the alignment is correct.

    Comment

    • apfejes
      Senior Member
      • Feb 2008
      • 236

      #3
      We often run multiple Eland alignments, using 32,29, 26 and 23 (or some variation of the above), and then identify the longest read where we get a unique match. This somewhat ignores your question.

      As chipper pointed out, this is really a strategy that's applicable for chip-seq or possibly transcriptome data. I wouldn't apply it to all analyses.

      There's a formula for coverting the quality score to probability, though, which you could use to figure out what probability you're comfortable with.

      I believe it is P =1 / (1 + 10^(-Q/10))

      You might want to confirm that, however, before you do anything with it. (I only obtained the equation second hand.)
      The more you know, the more you know you don't know. —Aristotle

      Comment

      • chris
        Member
        • Apr 2008
        • 52

        #4
        Thanks for the replies.

        Chipper. No I haven't looked at the scores for misaligned bases as in our analysis so far we only looked at two mismatches. It appears are the poor quality is concentrated at the 3' by up to 6nt so our analysis doesn't find matches with these.

        apfejes. Is the formula you mean the same as for converting Solexa scores to Phred scores as shown here: http://maq.sourceforge.net/fastq.shtml

        I think the best course of action will be to test various cut-offs and see what I get. I'll post back here if I get anything useful.
        Cheers.

        Comment

        • apfejes
          Senior Member
          • Feb 2008
          • 236

          #5
          chris,

          The formula on that page is related, but not identical. (Obviously, since they're both performing very similar transformations.) However, I was referring to the older format prb files, which contain values between -40 and 40, whereas the version you've indicated is used in the new eland pipeline. (I can't recall version numbers for them off hand.)

          If your probabilities are displayed in a format consistent with what's on that page, however, then it's most likely the correct format to use. If you are using the old-style prb files, where each base is represented by four numbers, then the version I've written above is more likely to be correct.

          Cheers,
          Anthony
          The more you know, the more you know you don't know. —Aristotle

          Comment

          • chris
            Member
            • Apr 2008
            • 52

            #6
            Hi Anthony,

            The scores I have range from -5 to 40 which I believe is the current Solexa Genome Analyser quality score range, so I guess I'll stick with the 'new' formula.
            Thanks,

            Chris.

            Comment

            • chris
              Member
              • Apr 2008
              • 52

              #7
              Right. A quick looksee of the raw Solexa quality scores at a variety of cut-offs gives:

              Code:
              Q Cut-off    Frequency
              0            584079
              5            641244
              10           406655
              15           179174
              20            63783
              25            20300
              30             6389
              35             3454
              The frequency counts are for the number of sequence reads whose quality scores are *all* above the cut-off. Each sequence is only counted once and binned at the highest cut-off which it satisfies.

              I'm a bit worried as the majority of the data has quality score of <10. This is equivalent to a Phred score of 10.4 or 90% accuracy

              Does anyone else get this kind of quality or is this really a bad run?

              Comment

              • bioinfosm
                Senior Member
                • Jan 2008
                • 483

                #8
                Chris,
                Is this all 8 lanes data? Did you convert back the prb from Solexa to Q value? by *all* you mean the entire 36bp read?
                Let me know and I can get similar quality scores for the data.
                --
                bioinfosm

                Comment

                • chris
                  Member
                  • Apr 2008
                  • 52

                  #9
                  I'm not exactly sure. This data is kind of second hand and from a file called 's_3_sequence.txt'. There are 2.2M reads in the form:
                  Code:
                  HWI-EAS111_2:3:17:156:119:AGTGAGGTAGTAGATTGTATAGTTTCGTATGCC:23 40 40 40 40 40 40 40 26 40 40 40 40 21 40 40 40 40 40 40 40 33 40 40 40 31 29 40 11 38 7 35 22
                  And these are all 33bp reads.
                  Thanks for your help bioinfosm

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Pathogen Surveillance with Advanced Genomic Tools
                    by seqadmin




                    The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                    03-24-2025, 11:48 AM
                  • seqadmin
                    New Genomics Tools and Methods Shared at AGBT 2025
                    by seqadmin


                    This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                    The Headliner
                    The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                    03-03-2025, 01:39 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-20-2025, 05:03 AM
                  0 responses
                  49 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-19-2025, 07:27 AM
                  0 responses
                  57 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-18-2025, 12:50 PM
                  0 responses
                  50 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-03-2025, 01:15 PM
                  0 responses
                  201 views
                  0 reactions
                  Last Post seqadmin  
                  Working...