Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • base quality encoding changed after "bwa samse" command

    hello,

    Please look at base quality string in my sample2New.fq file


    @EBRI093151_0051:4:55:2998:9540#0/1
    ACAACACAGTGGGTTGGAGTAGAGCATCTCCAAAGGCCCTTTCCAATCCAACATGAGTAACTCAAGCTCTGCACCAGCCACGAAAAGGCAAGGCTTTGGAT
    +
    FFFFFFFFFFDFFBFDEAEEEFFFFFFFFCFFEFFCEEEDDFFEEEFEADDFDFDEEDFFE@FCDDD>ACDFADD?CCECDB<?@047:9@?BB+B@@@]]



    after commands

    opt/bwa-0.6.2/bwa index -a bwtsw -p ref reference.fa
    /opt/bwa-0.6.2/bwa aln -t 10 -f sample2New.sai -I ref sample2New.fq
    /opt/bwa-0.6.2/bwa samse -f sample2New.sam -r "@RG\tID:sample2\tPL:ILLUMINA\tPUu1\tLB:sample2\tSM:sample2" ref sample2New.sai sample2New.fq



    I can see changed base quality string in the sample2New.sam file

    EBRI093151_0051:4:55:2998:9540#0 0 Chr10 377653 0 101M * 0 0 ACAACACAGTGGGTTGGAGTAGAGCATCTCCAAAGGCCCTTTCCAATCCAACATGAGTAACTCAAGCTCTGCACCAGCCACGAAAAGGCAAGGCTTTGGAT ''''''''''%''#'%&"&&&''''''''$''&''$&&&%%''&&&'&"%%'%'%&&%''&!'$%%%^_"$%'"%% $$&$%#^] !^Q^U^XESC^Z! ##^L#!!!>> RG:Z:sample2 XT:A:R NM:i:0 X0:i:3 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:101 XA:Z:Chr10,+33,101M,0;Chr10,+242847,101M,0;


    and ofcourse the command

    java -Xmx8g -jar /opt/picard-tools-1.85/SortSam.jar SO=coordinate INPUT=sample2New.sam OUTPUT=sample2New.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true


    fails with error

    Exception in thread "main" java.lang.IllegalArgumentException: Invalid fastq character:

    Why "bwa samse" is changing quality encoding??
    Do you have an idea what Im doing wrong?

    thanks

  • #2
    Could you edit your post to use the [ code ] and [ /code ] tags? This is easily done via the advanced editor view where there is a button for this in the tool bar (not shown in the quick reply edit box).

    Comment


    • #3
      I've not checked all the bases (due to the forum formatting), however, it would appear to be down to a FASTQ encoding problem. It appears bwa defaulted to assuming the obsolete Illumina specific ASCII encoding of PHRED+64, while your data was actually the original standard Sanger ASCII encoding of PHRED+33 (now adopted by Illumina). For background, see:


      In your FASTQ file, the first base has quality code 'F', ASCII character 70. Under the Sanger FASTQ scheme that means 70-33 = quality 37. However, if read in as the obsolete Illumina scheme it would be 70-64 = 6 quality, which when output again in SAM format (which uses the Sanger FASTQ scheme) becomes 6+33 = ASCII 39 = ' (single quote).

      Solution - there is a command line option to tell bwa you have a Sanger style FASTQ file. Use it, otherwise you get a bad SAM/BAM file.

      Comment


      • #4
        Thank you for your help Maubp,

        Your explanations helped me to find the solutions

        The problem was in "bwa aln" cmmand
        /opt/bwa-0.6.2/bwa aln -t 10 -f sample2New.sai -I ref sample2New.fq

        from the documentation we can se "-I The input is in the Illumina 1.3+ read format (quality equals ASCII-64). ". So, everything is OK when I ommit the -I option.

        /opt/bwa-0.6.2/bwa aln -t 10 -f sample2New.sai ref sample2New.fq


        Once again, Thank You for your help.

        Comment


        • #5
          Well done - and thank you for posting back with the details for anyone searching about this again in the future.

          (I couldn't remember the details about the switch, and wasn't at a machine where I could quickly check - but this way you'll probably remember the problem and solution )

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          59 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          57 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          47 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X