Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina RTA 1.13.48 and ASCII 66 to 105

    I am get Illumina sequecing reads from BGI.
    They told me that the version is RTA 1.13.48 and the quality code is ASCII 66 to 105.
    (a) According to the info I read for Wiki and others, generally there should be two types of quality coding for illumina: (1) ASCII 64 to 126; Illiumina 1.3+ (2) ASCII 33 to 126; Illumina 1.8+.
    So, I am confused about which type of coding my data are? ASCII 64 or 33.

    (b) I want to use BWA and SAMtools to analysis my data. I will do BWA first and then apply SAMtools to the output BAM file.
    WBA's default setting is for ASCII 33, but the "-I" will allow data with ASCII 64.
    I am wondering what option should I use?
    Samtool's default is also for ASCII 33 (and "-6" in mpileup allow data with ASCII 64). I read some info show that BWA converts the SAM/BAM output to ASCII 33. So, I don't need to worry about the coding type for samtools analysis as long as the output is from BWA. Is that correct?

    Thanks,

    Chih

  • #2
    Originally posted by ymwur View Post
    I am get Illumina sequecing reads from BGI.
    They told me that the version is RTA 1.13.48 and the quality code is ASCII 66 to 105.
    (a) According to the info I read for Wiki and others, generally there should be two types of quality coding for illumina: (1) ASCII 64 to 126; Illiumina 1.3+ (2) ASCII 33 to 126; Illumina 1.8+.
    So, I am confused about which type of coding my data are? ASCII 64 or 33.

    Thanks,

    Chih
    See the "Encoding" section from the Wikipedia Article on FASTQ format: http://en.wikipedia.org/wiki/FASTQ_format. I am quoting two lines from that section below.

    Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores 0 to 2 have a slightly different meaning. The values 0 and 1 are no longer used and the value 2, encoded by ASCII 66 "B", is used also at the end of reads as a Read Segment Quality Control Indicator.
    Presumably the data you are receiving is in the illumina (1.5+) format and thus would start with Q-scores of ASCII 66.

    Did BGI run this data in past (> year ago) and you are getting it now?

    Comment


    • #3
      Originally posted by ymwur View Post
      I am get Illumina sequecing reads from BGI.
      They told me that the version is RTA 1.13.48 and the quality code is ASCII 66 to 105.
      (a) According to the info I read for Wiki and others, generally there should be two types of quality coding for illumina: (1) ASCII 64 to 126; Illiumina 1.3+ (2) ASCII 33 to 126; Illumina 1.8+.
      So, I am confused about which type of coding my data are? ASCII 64 or 33.
      The various version number above refer to different parts of the whole software stack. RTA (Real Time Analysis) is the on instrument software which is responsible for taking the data from the raw images to Illumina's BCL format files. BCL files are a proprietary Illumina binary file containing base calls and quality scores. The current, latest version of RTA is 1.13.48. The version of RTA has no bearing on which ASCII offset is used for quality scores. The ASCII character encoding of quality scores is strictly a feature of the FASTQ file format and RTA does not produce FASTQ files.

      BCL files are most commonly converted to FASTQ files by Illumina's CASAVA pipeline. The current version of CASAVA is 1.8.2. It is this version number which is referred to in the FASTQ Wikipedia article as the "Illumina" version. (For completeness sake this is because Illumina used to call this toolset by names other than CASAVA while maintaining the same version number progression so for simplicity they are all referred to as "Illumina".) If, as BGI says your FASTQ files will have quality scores encoded by ASCII characters in the range of 66-105 then it would appear that they are still using a v1.5 toolset. The information in the Wikipedia article is a little misleading with regards to v1.5. The quality values may include '2' (ASCII 66 == 'B') meant to indicate a base whose quality score could not be accurately determined, and '41' (ASCII 105 == 'i'). So the true ASCII range for v1.5 is 66-105. (Why BGI continues to use this older version is a separate question.)

      But regardless of what BGI (or any other service provider) tells you the encoding is you can (and probably should) check the FASTQ files yourself to determine what the offset is. Look at a few lines from the file to see what range of characters are present in the quality strings and compare those to the ranges shown in the Wiki article. If you see any digits in the quality string that is a clear indication of Phred-33 scores. By contrast if you see any of the lowercase letters in the range a-i that indicates Phred-64.

      (b) I want to use BWA and SAMtools to analysis my data. I will do BWA first and then apply SAMtools to the output BAM file.
      WBA's default setting is for ASCII 33, but the "-I" will allow data with ASCII 64.
      I am wondering what option should I use?
      If your FASTQ files are indeed in the v1.5 format then yes, you should use the -I option with BWA.

      Samtool's default is also for ASCII 33 (and "-6" in mpileup allow data with ASCII 64). I read some info show that BWA converts the SAM/BAM output to ASCII 33. So, I don't need to worry about the coding type for samtools analysis as long as the output is from BWA. Is that correct?
      The SAM/BAM file format standard is to represent qualities with Phred-33 offsets so you are correct. Once you have a BAM file produced by BWA you should expect the quality values in it to be Phred-33.

      Comment


      • #4
        GenoMax, thanks for your info.
        You should be right. I just run fatsQC for my reads and it suggests the reads is in illumina 1.5.
        I get the reads late last year (2012) and this year from BGI. I was expecting it should be in newest format of illumina.

        My concern is that I am using BWA to map these reads. BWA's default setting is for illumina 1.8+(ASCII 33), but it also allows illumina 1.3+ (ASCII 64) if I use the "-I" option for its "aln" function. Should I identify my read as illumina 1.3+ (ASCII 64) for BWA since there is no option for illumina 1.5?

        Can someone give me advise?
        Thanks a lot,

        Chih

        Comment


        • #5
          Originally posted by ymwur View Post
          GenoMax, thanks for your info.
          You should be right. I just run fatsQC for my reads and it suggests the reads is in illumina 1.5.
          I get the reads late last year (2012) and this year from BGI. I was expecting it should be in newest format of illumina.

          My concern is that I am using BWA to map these reads. BWA's default setting is for illumina 1.8+(ASCII 33), but it also allows illumina 1.3+ (ASCII 64) if I use the "-I" option for its "aln" function. Should I identify my read as illumina 1.3+ (ASCII 64) for BWA since there is no option for illumina 1.5?

          Can someone give me advise?
          Thanks a lot,

          Chih
          As stated by kmcarr in the post above you should use -I with BWA.

          Comment


          • #6
            kmcarr and GenoMax,

            Thank you for your clear explanations.
            Now I know what to do.

            Chih

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X