Unconfigured Ad

**maubp** · 03-01-2011, 01:51 AM

As samtools don't work directly on FASTQ files, your question doesn't really make sense.

The read mapper or assembler used to make the SAM/BAM files (e.g. BWA, MAQ) may support various FASTQ encodings but they should all output the same encoding in the SAM/BAM output. The SAM plain text file format uses a Sanger FASTQ like encoding of quality scores, while BAM files store the qualities in binary format.

**dingxiaofan1** · 03-01-2011, 01:59 AM

Thank u. Yeah. I mean after the assemly. I have found some non-ref homo SNP positions with very strange nt distribution after using SAMtools mpileup & GATK to do the SNP call. Such like :
the ref is A
the reads at that positions could be 50 A and 10 T, but the caller call it alle:T and genotype 1/1. I have view the quality at that position under samtools tview. Indeed, most of the A is in blue or green color. So I think maybe I should not transfer the quality from Illumina to sanger (the default encoding system is Illumina > 1.3).

Originally posted by maubp View Post

As samtools don't work directly on FASTQ files, your question doesn't really make sense.

The read mapper or assembler used to make the SAM/BAM files (e.g. BWA, MAQ) may support various FASTQ encodings but they should all output the same encoding in the SAM/BAM output. The SAM plain text file format uses a Sanger FASTQ like encoding of quality scores, while BAM files store the qualities in binary format.

**Bruins** · 03-02-2011, 04:56 AM

just a thought, but are you perhaps confusing base call scores, mapping scores and SNP call scores, all 3 of which are represented as Phred scores?

**dingxiaofan1** · 03-02-2011, 10:52 AM

Originally posted by Bruins View Post

just a thought, but are you perhaps confusing base call scores, mapping scores and SNP call scores, all 3 of which are represented as Phred scores?

In my opinion, all of them may be presented in a quite similar way, like Phred score. And the SNP call scores should be calculated based on the other two, so the quality score do have a great influence on the final snp call.

**Bruins** · 03-03-2011, 02:49 AM

Originally posted by dingxiaofan1 View Post

In my opinion, all of them may be presented in a quite similar way, like Phred score.

They are.

Originally posted by dingxiaofan1 View Post

And the SNP call scores should be calculated based on the other two, so the quality score do have a great influence on the final snp call.

True.

I haven't used samtools tview so I don't know what blue or green means.

I'm not entirely sure I understand your question, so pardon me if I'm telling you stuff you already know.

The Phred scores for sanger fastqc and Illumina pipeline > 1.3 are calculated using the same formula, Illumina < 1.3 with a different formula (goolge for the wiki about this, it's out there). Then the scores are represented as an ASCII character. This is done in 2 different ways: adding 64 to get an ASCII character (I believe Illumina does this), or adding 33 to get a character (the sanger fastqc default).

Aligners like Bowtie and BWA use these per base quality scores. They may be able to deduce whether it's Phred64 or Phred33, I'm not sure. I do know BWA has an option to tell it which scheme was used.

Then the aligners calculate a mapping score, a log odds score to see the likelihood that the mapping was wrong. I think the same formula is used as the base calling one. This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

The same goes for the SNP call scores: likelihood the call is wrong. To calculate this score, mapping scores are used among other things. Your example shows a coverage of say 60 nt at the SNP position: 50 A's and 10 T's. The SNP caller checks all 60 and it may be that most of the 50 A's have a very low mapping quality or fail other checks the caller does. So in the end this SNP is supported by say 13 nt: 10 T's and 3 A's, the other 47 nt not being used in the call.

So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

Does that make sense?

**maubp** · 03-03-2011, 03:18 AM

Originally posted by Bruins View Post

This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

Yes, but it is not a default - the SAM spec is explicit that read quality scores are PHRED scores stored in ASCII with an offset of 33 (just like Sanger FASTQ).

**Bruins** · 03-03-2011, 03:51 AM

you mean that it's not a default because it is always the case? That's what I expected, I gues I just used the word default wrong :P

**maubp** · 03-03-2011, 04:07 AM

Originally posted by Bruins View Post

you mean that it's not a default because it is always the case? That's what I expected, I gues I just used the word default wrong :P

Yes - it should always be the case. However, if you use Solexa/Illumina FASTQ files with a mapper that expects Sanger FASTQ, you'll get inflated quality scores. Such a SAM file would be invalid since it would effectively be using the Solexa/Illumina encoding (and all derived scores like mapping quality would also be inflated). That would be bad.

**Bruins** · 03-03-2011, 05:13 AM

Originally posted by maubp View Post

Yes - it should always be the case. However, if you use Solexa/Illumina FASTQ files with a mapper that expects Sanger FASTQ, you'll get inflated quality scores. Such a SAM file would be invalid since it would effectively be using the Solexa/Illumina encoding (and all derived scores like mapping quality would also be inflated). That would be bad.

Originally posted by Bruins View Post

So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

dingxiaofan1, is this what you mean?

**dingxiaofan1** · 03-04-2011, 12:03 AM

Thank u so much. And very sorry for my late reply. Yes, that is what I mean. But I had checked fastq wiki and my file carefully. quality fields in Illumina fastq format is not the same as Sanger fastq.
Sanger can encode a Phred quality score from 0 to 93 using ASCII 33 to 126, while Illumina 1.3+ encode from 0-62 using 64-126.
Generally, the quality per base for Illumina is much higher than that of Sanger with the same ASCII character. So transfer or not means a lot if SAMtools and GATK using the Phred 33 as default.

Originally posted by Bruins View Post

dingxiaofan1, is this what you mean?

Originally posted by Bruins View Post

They are.

True.

I haven't used samtools tview so I don't know what blue or green means.

I'm not entirely sure I understand your question, so pardon me if I'm telling you stuff you already know.

The Phred scores for sanger fastqc and Illumina pipeline > 1.3 are calculated using the same formula, Illumina < 1.3 with a different formula (goolge for the wiki about this, it's out there). Then the scores are represented as an ASCII character. This is done in 2 different ways: adding 64 to get an ASCII character (I believe Illumina does this), or adding 33 to get a character (the sanger fastqc default).

Aligners like Bowtie and BWA use these per base quality scores. They may be able to deduce whether it's Phred64 or Phred33, I'm not sure. I do know BWA has an option to tell it which scheme was used.

Then the aligners calculate a mapping score, a log odds score to see the likelihood that the mapping was wrong. I think the same formula is used as the base calling one. This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

The same goes for the SNP call scores: likelihood the call is wrong. To calculate this score, mapping scores are used among other things. Your example shows a coverage of say 60 nt at the SNP position: 50 A's and 10 T's. The SNP caller checks all 60 and it may be that most of the 50 A's have a very low mapping quality or fail other checks the caller does. So in the end this SNP is supported by say 13 nt: 10 T's and 3 A's, the other 47 nt not being used in the call.

So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

Does that make sense?

**dingxiaofan1** · 03-04-2011, 12:27 AM

Maybe SAMTools and GATK are using Phred 33 as the default system. But I cannot find any options we could choose from the SNP callers. Also I cannot change the threshold score for base calculation (base Phred >=30 pass ? ). I think it is a bit too strict. And by allowing us using different threshold scores, we may find the best threshold suitable for our own data with less errors.

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 41 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 102 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 123 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 114 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

Default quality encoding system of SAMTools&GATK

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News