SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
FASTQC guessing wrong quality encoding PFS Bioinformatics 14 05-21-2014 07:41 AM
Quality score after Illumina run - should it be coverted before samtools and gatk? gfmgfm Bioinformatics 9 08-31-2013 10:01 PM
NO indel output from Samtools by using nearly default settings wwmm933 Bioinformatics 1 12-05-2012 03:18 AM
GATK base quality recalibration suppose to keep old and new quality scores? Heisman Bioinformatics 2 10-21-2011 07:40 AM
bowtie command line for Illumina Hiseq 2000 with Illumina 1.5+ quality encoding files rworthi Illumina/Solexa 4 09-28-2011 11:25 AM

Reply
 
Thread Tools
Old 02-28-2011, 09:36 PM   #1
dingxiaofan1
Member
 
Location: Hong kong

Join Date: Jul 2010
Posts: 17
Default Default quality encoding system of SAMTools&GATK

Hi. Everyone. Pls do me a great favor. If anyone knows what is the default encoding system for fastq (Sanger/Illumina etc) uesd in the calculation of SNP scores in SAMtools mpileup & GATK (UnifiedGenotyper) , pls let me know.
Thank you.
dingxiaofan1 is offline   Reply With Quote
Old 03-01-2011, 12:51 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,542
Default

As samtools don't work directly on FASTQ files, your question doesn't really make sense.

The read mapper or assembler used to make the SAM/BAM files (e.g. BWA, MAQ) may support various FASTQ encodings but they should all output the same encoding in the SAM/BAM output. The SAM plain text file format uses a Sanger FASTQ like encoding of quality scores, while BAM files store the qualities in binary format.
maubp is offline   Reply With Quote
Old 03-01-2011, 12:59 AM   #3
dingxiaofan1
Member
 
Location: Hong kong

Join Date: Jul 2010
Posts: 17
Default

Thank u. Yeah. I mean after the assemly. I have found some non-ref homo SNP positions with very strange nt distribution after using SAMtools mpileup & GATK to do the SNP call. Such like :
the ref is A
the reads at that positions could be 50 A and 10 T, but the caller call it alle:T and genotype 1/1. I have view the quality at that position under samtools tview. Indeed, most of the A is in blue or green color. So I think maybe I should not transfer the quality from Illumina to sanger (the default encoding system is Illumina > 1.3).


Quote:
Originally Posted by maubp View Post
As samtools don't work directly on FASTQ files, your question doesn't really make sense.

The read mapper or assembler used to make the SAM/BAM files (e.g. BWA, MAQ) may support various FASTQ encodings but they should all output the same encoding in the SAM/BAM output. The SAM plain text file format uses a Sanger FASTQ like encoding of quality scores, while BAM files store the qualities in binary format.
dingxiaofan1 is offline   Reply With Quote
Old 03-02-2011, 03:56 AM   #4
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

just a thought, but are you perhaps confusing base call scores, mapping scores and SNP call scores, all 3 of which are represented as Phred scores?
Bruins is offline   Reply With Quote
Old 03-02-2011, 09:52 AM   #5
dingxiaofan1
Member
 
Location: Hong kong

Join Date: Jul 2010
Posts: 17
Default

Quote:
Originally Posted by Bruins View Post
just a thought, but are you perhaps confusing base call scores, mapping scores and SNP call scores, all 3 of which are represented as Phred scores?
In my opinion, all of them may be presented in a quite similar way, like Phred score. And the SNP call scores should be calculated based on the other two, so the quality score do have a great influence on the final snp call.
dingxiaofan1 is offline   Reply With Quote
Old 03-03-2011, 01:49 AM   #6
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

Quote:
Originally Posted by dingxiaofan1 View Post
In my opinion, all of them may be presented in a quite similar way, like Phred score.
They are.
Quote:
Originally Posted by dingxiaofan1 View Post
And the SNP call scores should be calculated based on the other two, so the quality score do have a great influence on the final snp call.
True.

I haven't used samtools tview so I don't know what blue or green means.

I'm not entirely sure I understand your question, so pardon me if I'm telling you stuff you already know.

The Phred scores for sanger fastqc and Illumina pipeline > 1.3 are calculated using the same formula, Illumina < 1.3 with a different formula (goolge for the wiki about this, it's out there). Then the scores are represented as an ASCII character. This is done in 2 different ways: adding 64 to get an ASCII character (I believe Illumina does this), or adding 33 to get a character (the sanger fastqc default).

Aligners like Bowtie and BWA use these per base quality scores. They may be able to deduce whether it's Phred64 or Phred33, I'm not sure. I do know BWA has an option to tell it which scheme was used.

Then the aligners calculate a mapping score, a log odds score to see the likelihood that the mapping was wrong. I think the same formula is used as the base calling one. This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

The same goes for the SNP call scores: likelihood the call is wrong. To calculate this score, mapping scores are used among other things. Your example shows a coverage of say 60 nt at the SNP position: 50 A's and 10 T's. The SNP caller checks all 60 and it may be that most of the 50 A's have a very low mapping quality or fail other checks the caller does. So in the end this SNP is supported by say 13 nt: 10 T's and 3 A's, the other 47 nt not being used in the call.

So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

Does that make sense?
Bruins is offline   Reply With Quote
Old 03-03-2011, 02:18 AM   #7
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,542
Default

Quote:
Originally Posted by Bruins View Post
This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.
Yes, but it is not a default - the SAM spec is explicit that read quality scores are PHRED scores stored in ASCII with an offset of 33 (just like Sanger FASTQ).
maubp is offline   Reply With Quote
Old 03-03-2011, 02:51 AM   #8
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

you mean that it's not a default because it is always the case? That's what I expected, I gues I just used the word default wrong :P
Bruins is offline   Reply With Quote
Old 03-03-2011, 03:07 AM   #9
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,542
Default

Quote:
Originally Posted by Bruins View Post
you mean that it's not a default because it is always the case? That's what I expected, I gues I just used the word default wrong :P
Yes - it should always be the case. However, if you use Solexa/Illumina FASTQ files with a mapper that expects Sanger FASTQ, you'll get inflated quality scores. Such a SAM file would be invalid since it would effectively be using the Solexa/Illumina encoding (and all derived scores like mapping quality would also be inflated). That would be bad.
maubp is offline   Reply With Quote
Old 03-03-2011, 04:13 AM   #10
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

Quote:
Originally Posted by maubp View Post
Yes - it should always be the case. However, if you use Solexa/Illumina FASTQ files with a mapper that expects Sanger FASTQ, you'll get inflated quality scores. Such a SAM file would be invalid since it would effectively be using the Solexa/Illumina encoding (and all derived scores like mapping quality would also be inflated). That would be bad.
Quote:
Originally Posted by Bruins View Post
So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.
dingxiaofan1, is this what you mean?
Bruins is offline   Reply With Quote
Old 03-03-2011, 11:03 PM   #11
dingxiaofan1
Member
 
Location: Hong kong

Join Date: Jul 2010
Posts: 17
Default

Thank u so much. And very sorry for my late reply. Yes, that is what I mean. But I had checked fastq wiki and my file carefully. quality fields in Illumina fastq format is not the same as Sanger fastq.
Sanger can encode a Phred quality score from 0 to 93 using ASCII 33 to 126, while Illumina 1.3+ encode from 0-62 using 64-126.
Generally, the quality per base for Illumina is much higher than that of Sanger with the same ASCII character. So transfer or not means a lot if SAMtools and GATK using the Phred 33 as default.


Quote:
Originally Posted by Bruins View Post
dingxiaofan1, is this what you mean?
Quote:
Originally Posted by Bruins View Post
They are.

True.

I haven't used samtools tview so I don't know what blue or green means.

I'm not entirely sure I understand your question, so pardon me if I'm telling you stuff you already know.

The Phred scores for sanger fastqc and Illumina pipeline > 1.3 are calculated using the same formula, Illumina < 1.3 with a different formula (goolge for the wiki about this, it's out there). Then the scores are represented as an ASCII character. This is done in 2 different ways: adding 64 to get an ASCII character (I believe Illumina does this), or adding 33 to get a character (the sanger fastqc default).

Aligners like Bowtie and BWA use these per base quality scores. They may be able to deduce whether it's Phred64 or Phred33, I'm not sure. I do know BWA has an option to tell it which scheme was used.

Then the aligners calculate a mapping score, a log odds score to see the likelihood that the mapping was wrong. I think the same formula is used as the base calling one. This mapping score is represented in the SAM file as an ASCII character, just like the base quality in the fastq file. I think in SAM, the default is Phred33, but you should be able to check this in the sam specs or whatever. The samtools and GATK SNP callers know this, you don't have to think about this.

The same goes for the SNP call scores: likelihood the call is wrong. To calculate this score, mapping scores are used among other things. Your example shows a coverage of say 60 nt at the SNP position: 50 A's and 10 T's. The SNP caller checks all 60 and it may be that most of the 50 A's have a very low mapping quality or fail other checks the caller does. So in the end this SNP is supported by say 13 nt: 10 T's and 3 A's, the other 47 nt not being used in the call.

So to conclude: it is important that the aligner knows the Phred scheme used in the fastq files. After that, you don't have to worry about the scheme.

Does that make sense?
dingxiaofan1 is offline   Reply With Quote
Old 03-03-2011, 11:27 PM   #12
dingxiaofan1
Member
 
Location: Hong kong

Join Date: Jul 2010
Posts: 17
Default

Maybe SAMTools and GATK are using Phred 33 as the default system. But I cannot find any options we could choose from the SNP callers. Also I cannot change the threshold score for base calculation (base Phred >=30 pass ? ). I think it is a bit too strict. And by allowing us using different threshold scores, we may find the best threshold suitable for our own data with less errors.
dingxiaofan1 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO