Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quality score threshold?

    Hi everyone,

    This is my first post, so please be gentle

    I'm working on some Solexa data for a collaborator and have noticed that the quality (as determined by matches to the genome) of the sequence reads drop off very quickly beyond ~25nt.

    Now that I have the actual Solexa read quality scores what kind of cut-offs to people use for throwing out 'junk' reads? Some informal discussions have suggested scores 30 and above..
    Any thoughts?
    Thanks,

    Chris.

  • #2
    I am interrested in this also, have you compared quality scores for the misalignad bases at different positions in the reads?

    I guess the sequences to use depends on the application, if you are only counting aligned positions (ChIP-seq, transcripome etc) it doesn't matter if the last part is crap as long as the alignment is correct.

    Comment


    • #3
      We often run multiple Eland alignments, using 32,29, 26 and 23 (or some variation of the above), and then identify the longest read where we get a unique match. This somewhat ignores your question.

      As chipper pointed out, this is really a strategy that's applicable for chip-seq or possibly transcriptome data. I wouldn't apply it to all analyses.

      There's a formula for coverting the quality score to probability, though, which you could use to figure out what probability you're comfortable with.

      I believe it is P =1 / (1 + 10^(-Q/10))

      You might want to confirm that, however, before you do anything with it. (I only obtained the equation second hand.)
      The more you know, the more you know you don't know. —Aristotle

      Comment


      • #4
        Thanks for the replies.

        Chipper. No I haven't looked at the scores for misaligned bases as in our analysis so far we only looked at two mismatches. It appears are the poor quality is concentrated at the 3' by up to 6nt so our analysis doesn't find matches with these.

        apfejes. Is the formula you mean the same as for converting Solexa scores to Phred scores as shown here: http://maq.sourceforge.net/fastq.shtml

        I think the best course of action will be to test various cut-offs and see what I get. I'll post back here if I get anything useful.
        Cheers.

        Comment


        • #5
          chris,

          The formula on that page is related, but not identical. (Obviously, since they're both performing very similar transformations.) However, I was referring to the older format prb files, which contain values between -40 and 40, whereas the version you've indicated is used in the new eland pipeline. (I can't recall version numbers for them off hand.)

          If your probabilities are displayed in a format consistent with what's on that page, however, then it's most likely the correct format to use. If you are using the old-style prb files, where each base is represented by four numbers, then the version I've written above is more likely to be correct.

          Cheers,
          Anthony
          The more you know, the more you know you don't know. —Aristotle

          Comment


          • #6
            Hi Anthony,

            The scores I have range from -5 to 40 which I believe is the current Solexa Genome Analyser quality score range, so I guess I'll stick with the 'new' formula.
            Thanks,

            Chris.

            Comment


            • #7
              Right. A quick looksee of the raw Solexa quality scores at a variety of cut-offs gives:

              Code:
              Q Cut-off    Frequency
              0            584079
              5            641244
              10           406655
              15           179174
              20            63783
              25            20300
              30             6389
              35             3454
              The frequency counts are for the number of sequence reads whose quality scores are *all* above the cut-off. Each sequence is only counted once and binned at the highest cut-off which it satisfies.

              I'm a bit worried as the majority of the data has quality score of <10. This is equivalent to a Phred score of 10.4 or 90% accuracy

              Does anyone else get this kind of quality or is this really a bad run?

              Comment


              • #8
                Chris,
                Is this all 8 lanes data? Did you convert back the prb from Solexa to Q value? by *all* you mean the entire 36bp read?
                Let me know and I can get similar quality scores for the data.
                --
                bioinfosm

                Comment


                • #9
                  I'm not exactly sure. This data is kind of second hand and from a file called 's_3_sequence.txt'. There are 2.2M reads in the form:
                  Code:
                  HWI-EAS111_2:3:17:156:119:AGTGAGGTAGTAGATTGTATAGTTTCGTATGCC:23 40 40 40 40 40 40 40 26 40 40 40 40 21 40 40 40 40 40 40 40 33 40 40 40 31 29 40 11 38 7 35 22
                  And these are all 33bp reads.
                  Thanks for your help bioinfosm

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-27-2024, 06:37 PM
                  0 responses
                  13 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-27-2024, 06:07 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  69 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X