Seqanswers Leaderboard Ad

**GenoMax** · 04-12-2013, 03:11 AM

Originally posted by ymwur View Post

I am get Illumina sequecing reads from BGI.
They told me that the version is RTA 1.13.48 and the quality code is ASCII 66 to 105.
(a) According to the info I read for Wiki and others, generally there should be two types of quality coding for illumina: (1) ASCII 64 to 126; Illiumina 1.3+ (2) ASCII 33 to 126; Illumina 1.8+.
So, I am confused about which type of coding my data are? ASCII 64 or 33.

Thanks,

Chih

See the "Encoding" section from the Wikipedia Article on FASTQ format: http://en.wikipedia.org/wiki/FASTQ_format. I am quoting two lines from that section below.

Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores 0 to 2 have a slightly different meaning. The values 0 and 1 are no longer used and the value 2, encoded by ASCII 66 "B", is used also at the end of reads as a Read Segment Quality Control Indicator.

Presumably the data you are receiving is in the illumina (1.5+) format and thus would start with Q-scores of ASCII 66.

Did BGI run this data in past (> year ago) and you are getting it now?

**kmcarr** · 04-12-2013, 06:35 AM

Originally posted by ymwur View Post

I am get Illumina sequecing reads from BGI.
They told me that the version is RTA 1.13.48 and the quality code is ASCII 66 to 105.
(a) According to the info I read for Wiki and others, generally there should be two types of quality coding for illumina: (1) ASCII 64 to 126; Illiumina 1.3+ (2) ASCII 33 to 126; Illumina 1.8+.
So, I am confused about which type of coding my data are? ASCII 64 or 33.

The various version number above refer to different parts of the whole software stack. RTA (Real Time Analysis) is the on instrument software which is responsible for taking the data from the raw images to Illumina's BCL format files. BCL files are a proprietary Illumina binary file containing base calls and quality scores. The current, latest version of RTA is 1.13.48. The version of RTA has no bearing on which ASCII offset is used for quality scores. The ASCII character encoding of quality scores is strictly a feature of the FASTQ file format and RTA does not produce FASTQ files.

BCL files are most commonly converted to FASTQ files by Illumina's CASAVA pipeline. The current version of CASAVA is 1.8.2. It is this version number which is referred to in the FASTQ Wikipedia article as the "Illumina" version. (For completeness sake this is because Illumina used to call this toolset by names other than CASAVA while maintaining the same version number progression so for simplicity they are all referred to as "Illumina".) If, as BGI says your FASTQ files will have quality scores encoded by ASCII characters in the range of 66-105 then it would appear that they are still using a v1.5 toolset. The information in the Wikipedia article is a little misleading with regards to v1.5. The quality values may include '2' (ASCII 66 == 'B') meant to indicate a base whose quality score could not be accurately determined, and '41' (ASCII 105 == 'i'). So the true ASCII range for v1.5 is 66-105. (Why BGI continues to use this older version is a separate question.)

But regardless of what BGI (or any other service provider) tells you the encoding is you can (and probably should) check the FASTQ files yourself to determine what the offset is. Look at a few lines from the file to see what range of characters are present in the quality strings and compare those to the ranges shown in the Wiki article. If you see any digits in the quality string that is a clear indication of Phred-33 scores. By contrast if you see any of the lowercase letters in the range a-i that indicates Phred-64.

(b) I want to use BWA and SAMtools to analysis my data. I will do BWA first and then apply SAMtools to the output BAM file.
WBA's default setting is for ASCII 33, but the "-I" will allow data with ASCII 64.
I am wondering what option should I use?

If your FASTQ files are indeed in the v1.5 format then yes, you should use the -I option with BWA.

Samtool's default is also for ASCII 33 (and "-6" in mpileup allow data with ASCII 64). I read some info show that BWA converts the SAM/BAM output to ASCII 33. So, I don't need to worry about the coding type for samtools analysis as long as the output is from BWA. Is that correct?

The SAM/BAM file format standard is to represent qualities with Phred-33 offsets so you are correct. Once you have a BAM file produced by BWA you should expect the quality values in it to be Phred-33.

**ymwur** · 04-12-2013, 06:43 AM

GenoMax, thanks for your info.
You should be right. I just run fatsQC for my reads and it suggests the reads is in illumina 1.5.
I get the reads late last year (2012) and this year from BGI. I was expecting it should be in newest format of illumina.

My concern is that I am using BWA to map these reads. BWA's default setting is for illumina 1.8+(ASCII 33), but it also allows illumina 1.3+ (ASCII 64) if I use the "-I" option for its "aln" function. Should I identify my read as illumina 1.3+ (ASCII 64) for BWA since there is no option for illumina 1.5?

Can someone give me advise?
Thanks a lot,

Chih

**GenoMax** · 04-12-2013, 06:49 AM

Originally posted by ymwur View Post

GenoMax, thanks for your info.
You should be right. I just run fatsQC for my reads and it suggests the reads is in illumina 1.5.
I get the reads late last year (2012) and this year from BGI. I was expecting it should be in newest format of illumina.

My concern is that I am using BWA to map these reads. BWA's default setting is for illumina 1.8+(ASCII 33), but it also allows illumina 1.3+ (ASCII 64) if I use the "-I" option for its "aln" function. Should I identify my read as illumina 1.3+ (ASCII 64) for BWA since there is no option for illumina 1.5?

Can someone give me advise?
Thanks a lot,

Chih

As stated by kmcarr in the post above you should use -I with BWA.

**ymwur** · 04-12-2013, 06:53 AM

kmcarr and GenoMax,

Thank you for your clear explanations.
Now I know what to do.

Chih

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Illumina RTA 1.13.48 and ASCII 66 to 105

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News