Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina quality scores

    I wonder if someone with more intimate knowledge of the Solexa pipeline could shed some light on the different varieties of quality scores produced and how they relate to one another. Just to be clear, I'm not referring to the difference b/n Solexa and Phred scores or conversion to ascii. From my limited knowledge, there appear to be at least two types of Q-scores produced by the pipeline: intensity-based (found in .prb files from Bustard) and alignment based (found in fastq files from Gerald). There also seems to be some kind of quality calibration going on (using a "precalculated calibration table"?).
    To give some context, I am working with paired-end reads from a bacterial genome using the v1.3 pipeline. I am finding the fastq quality scores are much lower than those from the .prb files (almost entirely Q22 compared to Q40). I'm wondering which scores better represent the quality and why Q22 would be so over-represented in the fastq.

    Thanks!

    BTW, here is a snippet of my fastq file in case I my interpretation is wrong:

    @Paired_run:7:1:305:1931/1
    GAAATAGATGAAGATTTAATTATTGCTCCTAAAT
    +Paired_run:7:1:305:1931/1
    VVVVVVVVVVVVVVVVVVVVVVVVUVVVVVUUUU
    @Paired_run:7:1:315:1920/1
    GACTAAACTGTAGCAATGGTTTAAATGATGATCT
    +Paired_run:7:1:315:1920/1
    VVVVVVVVVVVVVVVVVVVVVVVVVVVVVUUUUU
    @Paired_run:7:1:341:1932/1
    GCTAATGATGTTCTTGATAATTTAAACAAAATTG
    +Paired_run:7:1:341:1932/1
    VVVVVVVVVVVVVVVUVVVVVVVVVVVVVVUUUS
    @Paired_run:7:1:302:1939/1
    GAAATAGATGAAGATTTAATTATTGCTCCTAAAT
    +Paired_run:7:1:302:1939/1
    VVVVVVVVVVVVVVVVVVVVVVVVUVVVVVUUUU
    @Paired_run:7:1:212:1540/1
    GTTAGAATTAATCAAATTGTATGGATGTGTGTAG
    +Paired_run:7:1:212:1540/1
    VUVVVVVVVVVVVVVVVVVUVVUUVVRVSVRUUS
    @Paired_run:7:1:173:757/1
    GTAGACGTATCAGGAGTTTCTAAAGGTAAGGGAT
    +Paired_run:7:1:173:757/1
    VVVVVVVVVVVUVVVVVVVVVVVVVUVVVVUUUU

  • #2
    I didn't know Gerald could produce fastq files directly. We use a perl script to extract information from the *_ub_custom_qseq.txt files produced by Gerald and convert it to fastq format (discarding the non-PF reads in the process). The ascii scores in the qseq files are scaled by 64.

    Can you post the Gerald config file you used to create the fastq?

    SillyPoint

    Comment


    • #3
      Gerald can generate fasta, fastq, or scarf (default) files.

      for fastq files put the line:

      12345678:SEQUENCE_FORMAT --fastq

      in your Gerald config file.

      Christine
      Christine Brennan
      UM DNA Sequencing Core
      Ann Arbor, MI 48109

      [email protected]

      Comment


      • #4
        I looked for the meaning of illumina quality scores and couldn't find any direct translation so here it is (in case it is of any use to someone else)

        Illumina quality score dictionary :

        ASCII / numeric / base probability to be wrong
        @ 0 1
        A 1 0.7943282347
        B 2 0.6309573445
        C 3 0.5011872336
        D 4 0.3981071706
        E 5 0.316227766
        F 6 0.2511886432
        G 7 0.1995262315
        H 8 0.1584893192
        I 9 0.1258925412
        J 10 0.1
        K 11 0.0794328235
        L 12 0.0630957344
        M 13 0.0501187234
        N 14 0.0398107171
        O 15 0.0316227766
        P 16 0.0251188643
        Q 17 0.0199526231
        R 18 0.0158489319
        S 19 0.0125892541
        T 20 0.01
        U 21 0.0079432823
        V 22 0.0063095734
        W 23 0.0050118723
        X 24 0.0039810717
        Y 25 0.0031622777
        Z 26 0.0025118864
        [ 27 0.0019952623
        \ 28 0.0015848932
        ] 29 0.0012589254
        ^ 30 0.001
        _ 31 0.0007943282
        ` 32 0.0006309573
        a 33 0.0005011872
        b 34 0.0003981072
        c 35 0.0003162278
        d 36 0.0002511886
        e 37 0.0001995262
        f 38 0.0001584893
        g 39 0.0001258925
        h 40 0.0001
        i 41 7.94328234724282E-005
        j 42 6.30957344480193E-005
        k 43 5.01187233627272E-005
        l 44 3.98107170553497E-005
        m 45 3.16227766016837E-005
        n 46 2.51188643150957E-005
        o 47 1.99526231496888E-005
        p 48 1.58489319246111E-005
        q 49 1.25892541179417E-005
        r 50 0.00001
        s 51 7.94328234724281E-006
        t 52 6.30957344480192E-006
        u 53 5.01187233627272E-006
        v 54 3.98107170553497E-006
        w 55 3.16227766016838E-006
        x 56 2.51188643150958E-006
        y 57 1.99526231496888E-006
        z 58 1.58489319246111E-006
        { 59 1.25892541179417E-006
        | 60 0.000001
        } 61 7.9432823472428E-007
        ~ 62 0.000000631
        Last edited by Sylphide; 02-28-2011, 12:57 AM.

        Comment


        • #5
          for converting SCARF format to fastq

          Originally posted by Sylphide View Post
          I looked for the meaning of illumina quality scores and couldn't find any direct translation so here it is (in case it is of any use to someone else)

          Illumina quality score dictionary :

          text illumina_score
          @ 0
          A 1
          B 2
          .
          .
          .
          hello Sylphide,
          Just to reconfirm. Can I use this conversion table to convert quality score in SCARF ASCII format to SCARF numeric, so that I can then use 'fq_all2std.pl' (from Maq site) to generate standard fastq format. The script assumes the quality score in .scarf file to be in numeric form whereas I have the files with scores in ASCII form.
          I'm a beginner in sequencing data analysis. Kindly help out
          thanks

          Comment


          • #6
            hello
            I'm also a beginner but I'll try to help.
            You can use the conversion table I wrote to convert ASCII to numeric if you want to program it yourself. There must be some tool to make the conversion automatically but I couldn't find any.

            ps : I added the probability for a base to be wrong in my previous message.

            Comment


            • #7
              hello Sylphide,
              I cleared my confusion from here. Basically what I understood is Solexa quality in ASCII is encoded with an offset of 33 whereas Illumina 1.3+ quality has an offset of 64. Now I can parse the .scarf file if I have to.
              There are many tools to convert between qualities, but I know of only one which is free and accepts .scarf input. Thats the "fq_all2std.pl" from Maq site.
              thanks anyways! I started hunt around about quality encoding from your post :-)

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X