Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • base quality encoding changed after tophat2 mapping

    fastq
    Code:
    @FCC22UBACXX:2:1101:1463:2233#ACACGCGG/1
    TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
    +
    ___acdeegee[efghgbbhgfdehhhfffffhZa^fggbaefhhghhh
    after tophat2 mapping, the fastq base quality was changed into:
    Code:
    FCC22UBACXX:2:1115:5744:8409#ACACGCGG   256     EQ110773        344     3       49M     *       0       0       ATAAAACCCGACAAAAGCTGTTCGGAAAGCTCTACGGGCTCGACCGGCA     CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJHFDDD       AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:MD:Z:49  YT:Z:UU NH:i:2  CC:Z:EQ123351   CP:i:3745       HI:i:0
    also, if the input was solexa-quality, the base quality in the output bam will be 0-40, which will influence the downstream analysis.

    Any idea?
    Thank you.

  • #2
    Your argument that TopHat has changed the quality scores would be more convincing if you posted the same sequence before and after alignment with TopHat.
    You've posted two different sequences, with obviously different quality scores.
    Can you post the same sequence, before and after alignment with TopHat?

    TopHat2 does have optional arguments to handle quality scores in the Solexa scale, incidentally.

    --solexa-quals Use the Solexa scale for quality values in FASTQ files.
    --solexa1.3-quals As of the Illumina GA pipeline version 1.3, quality scores are encoded in Phred-scaled base-64. Use this option for FASTQ files from pipeline 1.3 or later.

    Comment


    • #3
      Originally posted by pengchy View Post
      fastq
      Code:
      @FCC22UBACXX:2:1101:1463:2233#ACACGCGG/1
      TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
      +
      ___acdeegee[efghgbbhgfdehhhfffffhZa^fggbaefhhghhh
      after tophat2 mapping, the fastq base quality was changed into:
      Code:
      FCC22UBACXX:2:1115:5744:8409#ACACGCGG   256     EQ110773        344     3       49M     *       0       0       ATAAAACCCGACAAAAGCTGTTCGGAAAGCTCTACGGGCTCGACCGGCA     CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJHFDDD       AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:MD:Z:49  YT:Z:UU NH:i:2  CC:Z:EQ123351   CP:i:3745       HI:i:0
      also, if the input was solexa-quality, the base quality in the output bam will be 0-40, which will influence the downstream analysis.

      Any idea?
      Thank you.
      Tophat has done nothing to change the encoding. Quality scores in BAM files are stored as numeric values, not ASCII characters like they are in FASTQ files. When you view the contents of a BAM file with samtools view it will convert those numbers to ASCII characters for display and it will always use the Sanger Phred+33 encoding for display. It is only important that when you are using FASTQ files as input you correctly identify what encoding method is used in that FASTQ, then the correct q-score numbers will be stored in the BAM and everything downstream will work fine.

      Comment


      • #4
        Very informative answer kmcarr.
        Much better than mine.

        Comment


        • #5
          Originally posted by kmcarr View Post
          Tophat has done nothing to change the encoding. Quality scores in BAM files are stored as numeric values, not ASCII characters like they are in FASTQ files. When you view the contents of a BAM file with samtools view it will convert those numbers to ASCII characters for display and it will always use the Sanger Phred+33 encoding for display. It is only important that when you are using FASTQ files as input you correctly identify what encoding method is used in that FASTQ, then the correct q-score numbers will be stored in the BAM and everything downstream will work fine.
          Thank kmcarr and blancha's reply.

          More question:
          When I use the fastq with the base quality like this:
          Code:
          @FCC22UBACXX:2:1101:1463:2233#ACACGCGG
          TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
          +
          @@@BDEFFHFF<FGHIHCCIHGEFIIIGGGGGI;B?GHHCBFGIIHIII
          the output bam will be like this:
          Code:
          FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1
          FCC22UBACXX:2:1313:19247:88511#ACACGCGG 0       AF024514.1      15      50      49M     *       0       0       TATTGCTTCTATTTCGATTTTGTTCAAGCGTTGACCGTTGCAGGCGCTT     $$$(((((&((&(*))'%(**+&(*++*&)''')&)+++%()))'&()%       AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:18A30       YT:Z:UU NH:i:1
          Is it normal? I have used this bam file to call snp, none snp has been found, while several sites can be manually detected. Is it caused by this base quality?

          Comment


          • #6
            Originally posted by pengchy View Post
            Thank kmcarr and blancha's reply.

            More question:
            When I use the fastq with the base quality like this:
            Code:
            @FCC22UBACXX:2:1101:1463:2233#ACACGCGG
            TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
            +
            @@@BDEFFHFF<FGHIHCCIHGEFIIIGGGGGI;B?GHHCBFGIIHIII
            the output bam will be like this:
            Code:
            FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1
            FCC22UBACXX:2:1313:19247:88511#ACACGCGG 0       AF024514.1      15      50      49M     *       0       0       TATTGCTTCTATTTCGATTTTGTTCAAGCGTTGACCGTTGCAGGCGCTT     $$$(((((&((&(*))'%(**+&(*++*&)''')&)+++%()))'&()%       AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:18A30       YT:Z:UU NH:i:1
            Is it normal? I have used this bam file to call snp, none snp has been found, while several sites can be manually detected. Is it caused by this base quality?
            Pengchy,

            Again, you are showing us different reads from the FastQ file and from the BAM file. To best understand what is going on you really need to show us the FastQ record and the BAM record for the SAME read(s).

            Do you know what the quality encoding was in your input FastQ? You really need to have this information before you start your analysis.

            What version of Tophat are you using?

            When you ran Tophat did you use a command line parameter to tell Tophat what q-score encoding was used for the input FastQ or did you leave it as default?

            Comment


            • #7
              Originally posted by kmcarr View Post
              Pengchy,

              Again, you are showing us different reads from the FastQ file and from the BAM file. To best understand what is going on you really need to show us the FastQ record and the BAM record for the SAME read(s).

              Do you know what the quality encoding was in your input FastQ? You really need to have this information before you start your analysis.

              What version of Tophat are you using?

              When you ran Tophat did you use a command line parameter to tell Tophat what q-score encoding was used for the input FastQ or did you leave it as default?
              Hi kmcarr, thank you for the reply.

              Actually, I have done my work in the steps:
              1. mapped fastq of the reads with quality solexa 1.3, like:
              Code:
              @FCC22UBACXX:2:2112:11384:88500#ACACGCGG/1
              ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
              +
              a_aeeeeegggggiiiiWeghiighhhiiiihiiiiiiihhhiiihihi
              the read was mapped by tophat2 with parameters --solexa1.3-quals

              2. Then the output file "unmapped.bam" of tophat2 were converted to fastq format by bedtools-2.17.0/bin/bamToFastq, the output file of this read is:
              Code:
              @FCC22UBACXX:2:2112:11384:88500#ACACGCGG
              ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
              +
              B@BFFFFFHHHHHJJJJ8FHIJJHIIIJJJJIJJJJJJJIIIJJJIJIJ
              the base quality characters of bamToFastq output is the same to the "samtools view".

              3. The base quality is phred+33, the reads were mapped by tophat2 with the parameters:
              Code:
              /path/to/tophat2 --solexa-quals --library-type fr-unstranded -o /path/to/output/directory/ /genome/index /path/to/fastq.gz
              The "samtools view " of the output bam file give the following base quality like
              Code:
              FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1
              Here the read from the original fastq and bam file is same. I have checked many reads manually, the output bam file all have the similar quality characters.

              I don't know this phenomenon is caused by Tophat2 or "samtools view".

              I have correctly set the --solexa-quals for tophat2.

              We can have a check for the first character of these reads:
              the original: "a" == 97
              the first mapping result: "B"== 66
              the second mapping result: "%" == 37

              We can got that the first transformation has minus the ASCII by 31 and the second by 29.

              I guess the "samtools view" always transform the input bam ASCII by minus some value.

              Comment


              • #8
                Originally posted by pengchy View Post
                Hi kmcarr, thank you for the reply.

                Actually, I have done my work in the steps:
                1. mapped fastq of the reads with quality solexa 1.3, like:
                Code:
                @FCC22UBACXX:2:2112:11384:88500#ACACGCGG/1
                ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
                +
                a_aeeeeegggggiiiiWeghiighhhiiiihiiiiiiihhhiiihihi
                the read was mapped by tophat2 with parameters --solexa1.3-quals

                2. Then the output file "unmapped.bam" of tophat2 were converted to fastq format by bedtools-2.17.0/bin/bamToFastq, the output file of this read is:
                Code:
                @FCC22UBACXX:2:2112:11384:88500#ACACGCGG
                ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
                +
                B@BFFFFFHHHHHJJJJ8FHIJJHIIIJJJJIJJJJJJJIIIJJJIJIJ
                the base quality characters of bamToFastq output is the same to the "samtools view".

                3. The base quality is phred+33, the reads were mapped by tophat2 with the parameters:
                Code:
                /path/to/tophat2 [COLOR="Red"]--solexa-quals[/COLOR] --library-type fr-unstranded -o /path/to/output/directory/ /genome/index /path/to/fastq.gz
                The "samtools view " of the output bam file give the following base quality like
                Code:
                FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1
                Here the read from the original fastq and bam file is same. I have checked many reads manually, the output bam file all have the similar quality characters.

                I don't know this phenomenon is caused by Tophat2 or "samtools view".

                I have correctly set the --solexa-quals for tophat2 (No, this is incorrect).

                We can have a check for the first character of these reads:
                the original: "a" == 97
                the first mapping result: "B"== 66
                the second mapping result: "%" == 37

                We can got that the first transformation has minus the ASCII by 31 and the second by 29.

                I guess the "samtools view" always transform the input bam ASCII by minus some value.
                So if I am following your logic you:

                1. Map your original reads to some reference with Tophat2.

                2. Extract the unmapped reads from unmapped.bam to a new FastQ

                3. Map the unmapped reads to some reference with Tophat2. I am going to assume that this mapping is to some different reference since trying to align unmapped reads back to the same reference wouldn't make sense.

                I have highlighted the problem in red above. You correctly state "The base quality is phred+33", but then you incorrectly set "--solexa-quals" in your tophat command, which is telling tophat that your FastQ file is phred+64. Drop "--solexa-quals" from your second tophat alignment (step 3). Phred+33 is the default for tophat2 so there is no command line option used to specify it.

                Comment


                • #9
                  Originally posted by kmcarr View Post
                  So if I am following your logic you:

                  1. Map your original reads to some reference with Tophat2.

                  2. Extract the unmapped reads from unmapped.bam to a new FastQ

                  3. Map the unmapped reads to some reference with Tophat2. I am going to assume that this mapping is to some different reference since trying to align unmapped reads back to the same reference wouldn't make sense.

                  I have highlighted the problem in red above. You correctly state "The base quality is phred+33", but then you incorrectly set "--solexa-quals" in your tophat command, which is telling tophat that your FastQ file is phred+64. Drop "--solexa-quals" from your second tophat alignment (step 3). Phred+33 is the default for tophat2 so there is no command line option used to specify it.
                  thank you kmcarr, as you found, I have wrongly set the quality parameter for tophat2.

                  Your suggestion is very useful and help me a lot.

                  Best,
                  Pengcheng

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Advancing Precision Medicine for Rare Diseases in Children
                    by seqadmin




                    Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                    12-16-2024, 07:57 AM
                  • seqadmin
                    Recent Advances in Sequencing Technologies
                    by seqadmin



                    Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                    Long-Read Sequencing
                    Long-read sequencing has seen remarkable advancements,...
                    12-02-2024, 01:49 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 12-17-2024, 10:28 AM
                  0 responses
                  33 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-13-2024, 08:24 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-12-2024, 07:41 AM
                  0 responses
                  34 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 12-11-2024, 07:45 AM
                  0 responses
                  46 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X