Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dingxiaofan1
    Member
    • Jul 2010
    • 17

    Default quality encoding system of SAMTools&GATK

    Hi. Everyone. Pls do me a great favor. If anyone knows what is the default encoding system for fastq (Sanger/Illumina etc) uesd in the calculation of SNP scores in SAMtools mpileup & GATK (UnifiedGenotyper) , pls let me know.
    Thank you.
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    As samtools don't work directly on FASTQ files, your question doesn't really make sense.

    The read mapper or assembler used to make the SAM/BAM files (e.g. BWA, MAQ) may support various FASTQ encodings but they should all output the same encoding in the SAM/BAM output. The SAM plain text file format uses a Sanger FASTQ like encoding of quality scores, while BAM files store the qualities in binary format.

    Comment

    • dingxiaofan1
      Member
      • Jul 2010
      • 17

      #3
      Thank u. Yeah. I mean after the assemly. I have found some non-ref homo SNP positions with very strange nt distribution after using SAMtools mpileup & GATK to do the SNP call. Such like :
      the ref is A
      the reads at that positions could be 50 A and 10 T, but the caller call it alle:T and genotype 1/1. I have view the quality at that position under samtools tview. Indeed, most of the A is in blue or green color. So I think maybe I should not transfer the quality from Illumina to sanger (the default encoding system is Illumina > 1.3).


      Originally posted by maubp View Post
      As samtools don't work directly on FASTQ files, your question doesn't really make sense.

      The read mapper or assembler used to make the SAM/BAM files (e.g. BWA, MAQ) may support various FASTQ encodings but they should all output the same encoding in the SAM/BAM output. The SAM plain text file format uses a Sanger FASTQ like encoding of quality scores, while BAM files store the qualities in binary format.

      Comment

      • Bruins
        Member
        • Feb 2010
        • 78

        #4
        just a thought, but are you perhaps confusing base call scores, mapping scores and SNP call scores, all 3 of which are represented as Phred scores?

        Comment

        • dingxiaofan1
          Member
          • Jul 2010
          • 17

          #5
          Originally posted by Bruins View Post
          just a thought, but are you perhaps confusing base call scores, mapping scores and SNP call scores, all 3 of which are represented as Phred scores?
          In my opinion, all of them may be presented in a quite similar way, like Phred score. And the SNP call scores should be calculated based on the other two, so the quality score do have a great influence on the final snp call.

          Comment

          • Bruins
            Member
            • Feb 2010
            • 78

            #6
            Originally posted by dingxiaofan1 View Post
            In my opinion, all of them may be presented in a quite similar way, like Phred score.
            They are.
            Originally posted by dingxiaofan1 View Post
            And the SNP call scores should be calculated based on the other two, so the quality score do have a great influence on the final snp call.
            True.

            I haven't used samtools tview so I don't know what blue or green means.

            I'm not entirely sure I understand your question, so pardon me if I'm telling you stuff you already know.

            The Phred scores for sanger fastqc and Illumina pipeline > 1.3 are calculated using the same formula, Illumina < 1.3 with a different formula (goolge for the wiki about this, it's out there). Then the scores are represented as an ASCII character. This is done in 2 different ways: adding 64 to get an ASCII character (I believe Illumina does this), or adding 33 to get a character (the sanger fastqc default).

            Aligners like Bowtie and BWA use these per base quality scores. They may be able to deduce whether it's Phred64 or Phred33, I'm not sure. I do know BWA has an option to tell it which scheme was used.

            Then the aligners calculate a mapping score, a log odds score to see the likelihood that the mapping was wrong. I think the same formula is used as the base calling one. This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

            The same goes for the SNP call scores: likelihood the call is wrong. To calculate this score, mapping scores are used among other things. Your example shows a coverage of say 60 nt at the SNP position: 50 A's and 10 T's. The SNP caller checks all 60 and it may be that most of the 50 A's have a very low mapping quality or fail other checks the caller does. So in the end this SNP is supported by say 13 nt: 10 T's and 3 A's, the other 47 nt not being used in the call.

            So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

            Does that make sense?

            Comment

            • maubp
              Peter (Biopython etc)
              • Jul 2009
              • 1544

              #7
              Originally posted by Bruins View Post
              This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.
              Yes, but it is not a default - the SAM spec is explicit that read quality scores are PHRED scores stored in ASCII with an offset of 33 (just like Sanger FASTQ).

              Comment

              • Bruins
                Member
                • Feb 2010
                • 78

                #8
                you mean that it's not a default because it is always the case? That's what I expected, I gues I just used the word default wrong :P

                Comment

                • maubp
                  Peter (Biopython etc)
                  • Jul 2009
                  • 1544

                  #9
                  Originally posted by Bruins View Post
                  you mean that it's not a default because it is always the case? That's what I expected, I gues I just used the word default wrong :P
                  Yes - it should always be the case. However, if you use Solexa/Illumina FASTQ files with a mapper that expects Sanger FASTQ, you'll get inflated quality scores. Such a SAM file would be invalid since it would effectively be using the Solexa/Illumina encoding (and all derived scores like mapping quality would also be inflated). That would be bad.

                  Comment

                  • Bruins
                    Member
                    • Feb 2010
                    • 78

                    #10
                    Originally posted by maubp View Post
                    Yes - it should always be the case. However, if you use Solexa/Illumina FASTQ files with a mapper that expects Sanger FASTQ, you'll get inflated quality scores. Such a SAM file would be invalid since it would effectively be using the Solexa/Illumina encoding (and all derived scores like mapping quality would also be inflated). That would be bad.
                    Originally posted by Bruins View Post
                    So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.
                    dingxiaofan1, is this what you mean?

                    Comment

                    • dingxiaofan1
                      Member
                      • Jul 2010
                      • 17

                      #11
                      Thank u so much. And very sorry for my late reply. Yes, that is what I mean. But I had checked fastq wiki and my file carefully. quality fields in Illumina fastq format is not the same as Sanger fastq.
                      Sanger can encode a Phred quality score from 0 to 93 using ASCII 33 to 126, while Illumina 1.3+ encode from 0-62 using 64-126.
                      Generally, the quality per base for Illumina is much higher than that of Sanger with the same ASCII character. So transfer or not means a lot if SAMtools and GATK using the Phred 33 as default.


                      Originally posted by Bruins View Post
                      dingxiaofan1, is this what you mean?
                      Originally posted by Bruins View Post
                      They are.

                      True.

                      I haven't used samtools tview so I don't know what blue or green means.

                      I'm not entirely sure I understand your question, so pardon me if I'm telling you stuff you already know.

                      The Phred scores for sanger fastqc and Illumina pipeline > 1.3 are calculated using the same formula, Illumina < 1.3 with a different formula (goolge for the wiki about this, it's out there). Then the scores are represented as an ASCII character. This is done in 2 different ways: adding 64 to get an ASCII character (I believe Illumina does this), or adding 33 to get a character (the sanger fastqc default).

                      Aligners like Bowtie and BWA use these per base quality scores. They may be able to deduce whether it's Phred64 or Phred33, I'm not sure. I do know BWA has an option to tell it which scheme was used.

                      Then the aligners calculate a mapping score, a log odds score to see the likelihood that the mapping was wrong. I think the same formula is used as the base calling one. This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

                      The same goes for the SNP call scores: likelihood the call is wrong. To calculate this score, mapping scores are used among other things. Your example shows a coverage of say 60 nt at the SNP position: 50 A's and 10 T's. The SNP caller checks all 60 and it may be that most of the 50 A's have a very low mapping quality or fail other checks the caller does. So in the end this SNP is supported by say 13 nt: 10 T's and 3 A's, the other 47 nt not being used in the call.

                      So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

                      Does that make sense?

                      Comment

                      • dingxiaofan1
                        Member
                        • Jul 2010
                        • 17

                        #12
                        Maybe SAMTools and GATK are using Phred 33 as the default system. But I cannot find any options we could choose from the SNP callers. Also I cannot change the threshold score for base calculation (base Phred >=30 pass ? ). I think it is a bit too strict. And by allowing us using different threshold scores, we may find the best threshold suitable for our own data with less errors.

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                          by SEQadmin2


                          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                          Here are nine questions we think about, in roughly the order they matter, before...
                          06-18-2026, 07:11 AM
                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, 06-17-2026, 06:09 AM
                        0 responses
                        41 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-09-2026, 11:58 AM
                        0 responses
                        102 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-05-2026, 10:09 AM
                        0 responses
                        123 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-04-2026, 08:59 AM
                        0 responses
                        114 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...