Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quality score threshold?

    Hi everyone,

    This is my first post, so please be gentle

    I'm working on some Solexa data for a collaborator and have noticed that the quality (as determined by matches to the genome) of the sequence reads drop off very quickly beyond ~25nt.

    Now that I have the actual Solexa read quality scores what kind of cut-offs to people use for throwing out 'junk' reads? Some informal discussions have suggested scores 30 and above..
    Any thoughts?
    Thanks,

    Chris.

  • #2
    I am interrested in this also, have you compared quality scores for the misalignad bases at different positions in the reads?

    I guess the sequences to use depends on the application, if you are only counting aligned positions (ChIP-seq, transcripome etc) it doesn't matter if the last part is crap as long as the alignment is correct.

    Comment


    • #3
      We often run multiple Eland alignments, using 32,29, 26 and 23 (or some variation of the above), and then identify the longest read where we get a unique match. This somewhat ignores your question.

      As chipper pointed out, this is really a strategy that's applicable for chip-seq or possibly transcriptome data. I wouldn't apply it to all analyses.

      There's a formula for coverting the quality score to probability, though, which you could use to figure out what probability you're comfortable with.

      I believe it is P =1 / (1 + 10^(-Q/10))

      You might want to confirm that, however, before you do anything with it. (I only obtained the equation second hand.)
      The more you know, the more you know you don't know. —Aristotle

      Comment


      • #4
        Thanks for the replies.

        Chipper. No I haven't looked at the scores for misaligned bases as in our analysis so far we only looked at two mismatches. It appears are the poor quality is concentrated at the 3' by up to 6nt so our analysis doesn't find matches with these.

        apfejes. Is the formula you mean the same as for converting Solexa scores to Phred scores as shown here: http://maq.sourceforge.net/fastq.shtml

        I think the best course of action will be to test various cut-offs and see what I get. I'll post back here if I get anything useful.
        Cheers.

        Comment


        • #5
          chris,

          The formula on that page is related, but not identical. (Obviously, since they're both performing very similar transformations.) However, I was referring to the older format prb files, which contain values between -40 and 40, whereas the version you've indicated is used in the new eland pipeline. (I can't recall version numbers for them off hand.)

          If your probabilities are displayed in a format consistent with what's on that page, however, then it's most likely the correct format to use. If you are using the old-style prb files, where each base is represented by four numbers, then the version I've written above is more likely to be correct.

          Cheers,
          Anthony
          The more you know, the more you know you don't know. —Aristotle

          Comment


          • #6
            Hi Anthony,

            The scores I have range from -5 to 40 which I believe is the current Solexa Genome Analyser quality score range, so I guess I'll stick with the 'new' formula.
            Thanks,

            Chris.

            Comment


            • #7
              Right. A quick looksee of the raw Solexa quality scores at a variety of cut-offs gives:

              Code:
              Q Cut-off    Frequency
              0            584079
              5            641244
              10           406655
              15           179174
              20            63783
              25            20300
              30             6389
              35             3454
              The frequency counts are for the number of sequence reads whose quality scores are *all* above the cut-off. Each sequence is only counted once and binned at the highest cut-off which it satisfies.

              I'm a bit worried as the majority of the data has quality score of <10. This is equivalent to a Phred score of 10.4 or 90% accuracy

              Does anyone else get this kind of quality or is this really a bad run?

              Comment


              • #8
                Chris,
                Is this all 8 lanes data? Did you convert back the prb from Solexa to Q value? by *all* you mean the entire 36bp read?
                Let me know and I can get similar quality scores for the data.
                --
                bioinfosm

                Comment


                • #9
                  I'm not exactly sure. This data is kind of second hand and from a file called 's_3_sequence.txt'. There are 2.2M reads in the form:
                  Code:
                  HWI-EAS111_2:3:17:156:119:AGTGAGGTAGTAGATTGTATAGTTTCGTATGCC:23 40 40 40 40 40 40 40 26 40 40 40 40 21 40 40 40 40 40 40 40 33 40 40 40 31 29 40 11 38 7 35 22
                  And these are all 33bp reads.
                  Thanks for your help bioinfosm

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Advancing Precision Medicine for Rare Diseases in Children
                    by seqadmin




                    Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                    12-16-2024, 07:57 AM
                  • seqadmin
                    Recent Advances in Sequencing Technologies
                    by seqadmin



                    Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                    Long-Read Sequencing
                    Long-read sequencing has seen remarkable advancements,...
                    12-02-2024, 01:49 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 12-17-2024, 10:28 AM
                  0 responses
                  27 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-13-2024, 08:24 AM
                  0 responses
                  43 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-12-2024, 07:41 AM
                  0 responses
                  29 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-11-2024, 07:45 AM
                  0 responses
                  42 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X