Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bowtie-0.12.7 paired-end colorspace quality bigger than 4.0

    Some times in colorspace mode bowtie reports quality values bigger than 4.0
    Since, I think, this corresponds to probabilities bigger than 1
    something has gone wrong.

    The bowtie (0.12.9) manual says "the decoded nucleotide quality is either the sum of the overlapping color qualities....."
    Should I simply divide the reported value by 2 ?

    An example:
    bowtie --seed 102509 -C Mycoplasma_allC
    -1 SRR096735.21239778_1.filt.fastq
    -2 SRR096735.21239778_2.filt.fastq

    where SRR096735.21239778_1.filt.fastq contains:
    @SRR096735.21239778 VAB_0312_20101203_1_SP_ANG_TG_NA20517_3_1sA_01003380619_497_712_492/1
    T11222222222222222222222222222222222222222222222222
    +
    !=;@@=?<@>=?@>?@@A?@?@@??36>=@??A@>??@=???=<??>@<<>

    SRR096735.21239778_2.filt.fastq contains:
    @SRR096735.21239778 VAB_0312_20101203_1_SP_ANG_TG_NA20517_3_1sA_01003380619_497_712_492/2
    T00122222222222222222222222222222222
    +
    !BB@ABBB?BABB6@?BB:9@BB9;?AB;9+>A<?;


    The output is:
    SRR096735.21239778 VAB_0312_20101203_1_SP_ANG_TG_NA20517_3_1sA_01003380619_497_712_492/1 + gi|339320528|ref|NC_015725.1| 187309 CTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCT Z_\[Z[]Z[^]\^_`_^^^_^]QHSZ\^]_`]\]^\[]][XZ]\][WY 1
    SRR096735.21239778 VAB_0312_20101203_1_SP_ANG_TG_NA20517_3_1sA_01003380619_497_712_492/2 + gi|339320528|ref|NC_015725.1| 187326 ACTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCT a`bcc``bbcWU^`c[RXacZSY_b\SCH^\ZY 1 0:T>A

    Notice several quality values are Z (ascii 90, quality 5.7)

    I am using 64bit bowtie 0.12.7 under linux

    All comments and any help very welcome
    Bill

  • #2
    Hi,

    I'm afraid I don't know whether you should divide the quality values by 2 or not.

    A base quality value of ascii 90 should convert to phred 90 - 33 = 57.


    A quality value of 4 is actually quite low.

    Base quality values (phred quality) are defined as minus 10 times the log
    of the probability that the base call is an error.

    see:



    and p. 52 of the SOLiD manual here:




    Hope this helps,
    Maria

    Comment


    • #3
      Dear Maria,
      Apologies for delay in replying. I am less familar with colorspace coding.
      Thank you for the pointers.

      I got myself confused with negative log probabilities. If I have understood correctly
      a quality of 4 is 4.0 (which I think this mean a probability of 1.0e-4). This appears to be
      the best (smallest) value reported by some Solexa scanners. So ascii 90 (Z) quite legally
      means a probablity of 2e-06? 10**(-5.7)
      Thanks again
      Bill

      Comment


      • #4
        bowtie-0.12.7 paired-end colorspace quality bigger than 4.0

        Hi Bill,

        I think you are still a little confused.

        Low quality values are bad qualities, a higher probability that the base call is an error.

        High quality values are good, a low probability that the base call is an error.

        A quality value of 30 (-10 * log(1/1000)) is good, because it represents a probability of 1/1000 that it is an error.

        Solexa/Illumina qualities are even more confusing, because they have used at least 3 different quality scales over the years, before finally embracing the phred+33 encoding that everyone else uses.

        In the pre v1.3 Solexa, you could get negative quality values, as low as -5, these would have been represented as ascii 59 (-5+64), but the low end of the quality values would have represented very poor quality base calls.

        See this old blog post about the confusion of having 3 different quality scales:

        My notes on the sequence/quality data from Illumina genome analyser. Phred Phred is a program that takes the trace files produced by traditional DNA sequencing, calls the bases and assigns a qualit…


        hope this helps,
        Maria

        Comment


        • #5
          Dear Maria,
          Thank you for the pointer to Caroline's blog entry.
          If I have understood it correctly my fasta files must be "standard fastq:"
          since some of the quality values are '6' (ascii 54).
          This seems fine.

          So what of the output from Bowtie? It is given two DNA sequences
          (one of 51 colorspace characters the other of 36) and claims to have found
          an overlapping match of 49bp and 34bp, with quality values that include
          Z (ascii 90). Given the high values (above ascii 97) bowtie outputs,
          it really does look like it is adding together (as the bowtie manual says)
          the already high values in its input files.

          Thank you very much

          Bill

          Comment


          • #6
            bowtie-0.12.7 paired-end colorspace

            I hope I haven't made things even more confusing by digressing into the various Illumina quality scales, because when I looked back at your original post, I noticed that you have SOLiD data, and they seem to use the standard quality encodings.

            So ascii 54 should translate to quality 54 -33 = 21.




            As for the Bowtie output, I agree with you there, it does look like it has added the quality scores,
            but I haven't worked with paired-end SOLiD data, and I'm not entirely sure what the bowtie manual means, it could just be referring to the process of deriving per-base qualities from the SOLiD di-base encoding, where an observed colour represents two adjacent bases. The SOLiD qualities might just be very high because each base gets called twice, once as a pair with the base to the 5' side of it, and a second time as a pair with the base to the 3' side of it. Sorry I can't be of more help here.
            Last edited by mastal; 03-29-2013, 04:43 AM. Reason: spelling of Bowtie

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            58 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            45 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X