Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Qualities in pileup disagree with qualities in fastq

    I have a pileup from samtools (an mpileup too, but they agree with each other) that shows very odd quality scores. My fastq and .sam agree that the quality scores of the letters in the read range from B to h, which is normal.

    The fastq and the .sam agree that the quality of the reads falling across a certain locus are quite high. But the quality scores of the pileup are strange. They range from ! to h, and certain positions are uniformly low, sometimes extremely so.

    Here's a sample: The mixed SNP are 4107 looks awful based on the pileup, but going back to the original reads, the quality is great for C's and A's in that position. When I use an older version of samtools (0.1.7) on the same .bam, the pileup qualities are correct. When I use pileup in 0.1.12a, it looks like it does below, and the qualities are wrong.

    chr2 69574093 G G 111 0 36 28 .,...,.,..,.,,..,.,,.,,..,.. QhhfchhhhhhhhhhahhghhhhhehNN
    chr2 69574094 A A 111 0 36 28 .,...,.,..,.,,..,.,,.,,..,.. OhhgShhhhhhfhhc]fgghhhhfehNN
    chr2 69574095 G G 111 0 36 28 .,...,.,..,.,,..,.,,.,,..,.. NhhgahhhhhhhhhhbhhghhhhhchOO
    chr2 69574096 A A 111 0 36 28 .,...,.,..,.,,..,.,,.,,..,.. NhhfWhhhghhhhhh`hhgghhhhfhQQ
    chr2 69574097 G G 114 0 36 29 .$,...,.,..,.,,..,.,,.,,..,..^F. DhhgahhhhhhhehhadhghhhhhahRRD
    chr2 69574098 A A 111 0 36 28 ,...,.,..,.,,..,.,,.,,..,... hhdYhhhghhhehhahgghhhggchUUI
    chr2 69574099 T T 111 0 36 28 ,...,.,..,.,,..,.,,.,,..,... aaaaaaaaaaaaaaaaaaaaaaaaaaa_
    chr2 69574100 T T 114 0 36 29 ,...,.,..,.,,..,.,,.,,..,...^F, _aaaaaaaaaaaaaaaaaaaaaaaaaaaE
    chr2 69574101 G G 117 0 36 30 ,...,.,..,.,,..,.,,.,,..,...,^F, UhgchhhhhhhehhcchfhhhhhfhhhgUE
    chr2 69574102 A A 123 0 36 32 ,$...,.,..,.,,..,.,,.,,..,...,,^F.^:, >EMY\\O\\\O5OPPPO\O\\POP\OS\\U?@
    chr2 69574103 A A 120 0 36 31 ...,.,..,.,,..,.,,.,,..,...,,., BJZZZLZZZL2LMMMLZLZZMLMZLPZZ[AA
    chr2 69574104 A A 103 0 36 31 ...,.,..,.c,..,.,,.,,..,...,,., AFZZZJZZZK0KKKKKZKZZKKKZKNZZ[BB
    chr2 69574105 A A 120 0 36 32 .$..,.,..,.,,..,.,,.,,..,...,,.,^:. ?:\\\?\\\@5@@@@@\@\\@@@\@D\\]E?#
    chr2 69574106 C C 75 0 37 33 ..,.,..,.,,..,.,,.,,..,...,,.,.^:.^F, %^^a!aaa!J!!!!!a!aa!!!a!'aaaa!#!B
    chr2 69574107 C C 78 0 37 35 A$.,.a..,A,aAAaA,a.,aAA,AA.,,.aAA,^F.^:a !UUa!aaa!\!!!!!`!aa!!!a!%aaaa!!!EE!
    chr2 69574108 A A 78 0 37 34 .$,$.,..,.,,..,.,,.,,..,...,,.,..,., CCU!ehh!h!!!!!Z!hh!!!h!'ghhh!!$cU!
    chr2 69574109 G G 123 0 35 32 .,..,.,,..,.,,.,,..,...,,.,..,., R4bhhPhPPPPPgPhhQPQhPTghhhQGPhe?
    chr2 69574110 A A 123 0 35 32 .$,$..,.,,..,.,,.,,..,...,,.,..,., >4>\\P\PPPPP\P\\QPP\PT\\\\PPR\\A
    chr2 69574111 A A 117 0 35 30 .$.,.,,..,.,,.,,..,...,,.,..,., ;ZZVZVVVVVZVZZWVWZVXZZZZWUYZZD
    chr2 69574112 A A 114 0 35 29 .,.,,..,.,,.,,..,...,,.,..,., YYYZYYYYYZYZZYYYZYZZZZZYYYZZO
    chr2 69574113 A A 117 0 35 30 .,.,,..,.,,.,,..,...,,.,..,.,^F. TT\\\\\\\\\\\\\\\\\\\\\\\\\\SE
    chr2 69574114 C C 117 0 35 30 .,.,,..,.,,.,,..,...,,.,..,.,. RRhhhhgghghhhahZghhchhhfhhhhhR
    chr2 69574115 A A 117 0 35 30 .$,$.,,..,.,,.,,..,...,,.,..,.,. DDhhhhghgahhhgeVhhhghhhfhhghhU
    chr2 69574116 G G 114 0 35 29 .,,..,.,,.,,..,...,,.,..,.,.^F. hhbh`fhghhhhhOhfhghhgahhhhhhE
    chr2 69574117 A A 117 0 35 30 .,,..,.,,.,,..,...,,.,..,.,..^:, ah_cchhghhhdgThghghhhdhhhhhhU@
    chr2 69574118 T T 111 0 35 30 .,,..,.,,.,,.C,...,,.,..,.,.., Bh2PQQQgQhgQ+-hQQahhhVQQhhQhbB
    chr2 69574119 G G 114 0 35 31 .,,..,.,,.,,..,...,,.,..,.,..,^F. 0e/LMMMgMghM'-hMMghhhRMMfhMheAB
    chr2 69574120 C C 72 0 37 32 .,,..,.,,.,,..,...,,.,..,.,..,.^F. ,E+!!!!a!aa!$+a!!aaaa'!!aa!aa!BB
    chr2 69574121 C C 75 0 37 33 T$,$tTTtT,t.,tTT,TT.,,.tTT,.t..t..^F. !A!!!!!a!aa!!+a!!aaaa'!!aa!aa!BEE
    chr2 69574122 G G 117 0 35 32 ,$..,.,,.,,..,...,,.,..,.,..,...^:. !;<;;g;hh;!0h;;ahhh@;;gh;hh;BbRE
    chr2 69574123 C C 117 0 35 31 ..,.,,.,,..,...,,.,..,.,..,.... ;<;;g;hh;!2h;;dggh@;;hh;hh;BeUQ
    chr2 69574124 T T 120 0 34 32 ..,.,,.,,C.,...,,.,..,.,..,....^:. AQQQgQhhQ!7hQQdhhhVQQhhQhhPBgheE
    chr2 69574125 A A 126 0 34 33 .$.,.,,.,,..,...,,.,..,.,..,.....^:. @dhhfhghfNahhfdghhdhhhhhhhhBeghUE
    Last edited by swbarnes2; 01-10-2011, 03:24 PM. Reason: adding more information

  • #2
    I have a very similar problem and it seems to affect not only the pileup output but the VCF as well.

    I have a site which appears to be a SNP on visual inspection - all five reads at the location are T rather than the reference C. The base qualities at the site are very high in the fastq and SAM. The mapping qualities for the reads are also very good. The pileup however claims that the qualities of the bases at that site are all 0.

    The VCF fails to recognize the existence of an alternate base at this location and has all '0' values for the I16 and PL fields. The '0' values also occur at the following two bases. This is not the output of bcftool's variant detection, this is just the raw output of samtools mpileup -ugf.

    Here's what the VCF looks like (putative SNP is at 24,335,869):

    Code:
    Chr5    24335868        .       G       X       0       .       DP=6;I16=5,1,0,0,172,5060,0,0,222,8214,0,0,78,1796,0,0  PL      0,18,130
    Chr5    24335869        .       C       X       0       .       DP=5;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0        PL      0,0,0
    Chr5    24335870        .       T       X       0       .       DP=4;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0        PL      0,0,0
    Chr5    24335871        .       T       X       0       .       DP=4;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0        PL      0,0,0
    Chr5    24335872        .       C       X       0       .       DP=4;I16=3,1,0,0,220,12100,0,0,148,5476,0,0,76,1626,0,0 PL      0,12,129
    Can anyone explain why this is occurring?

    Comment


    • #3
      Yes, the vcf would be affected, since it's the same mpileup program. And that's how I caught this; I had a list of variants from the old pileup software, then I updated to the new software, and some old SNPs were not in the vcf. When I checked the new pileup, I saw that it thought that the quality of the SNPs was crap, so I tried to see why the old variant caller had allowed it, and then I realized that the old variant caller was correctly assigning the quality scores, while the new pileup and mpileup were making them far lower than the fastq. Maybe it's doing that for a reason? Related to mapping quality? But I can't see how or why.

      Comment


      • #4
        was illumina quality converted to standard sanger quality?

        Comment


        • #5
          I used the sam2vcf.pl script in the misc folder, and that seems to be working properly, giving it the results of the older version of pileup, where the letter qualities match the fastq, instead of the altered quality scores in the mpileup or pileup. I checked, and the good mixed SNPs that were visible in my old pileup variant file, and not in the new pileup or mpileup files (or the vcfs derived from those) are now visible. Unfortunately, that perl script makes vcf files in an older version; 3.3, instead of 4.0.

          But it's a decent workaround.

          But does anyone know why the letter quality scores in the new pileups and mpileups don't match the letter quality scores from the fastq?

          Comment


          • #6
            I've managed to get the correct quality values by specifying the -B option to samtools mpileup, which disables the "probabilistic realignment for the computation of BAQ". So I'm guessing that mpileup believes there is a misalignment occurring due to an indel. However, when I look at the alignment in IGV it's clearly a SNP, with no indel in sight.

            Comment


            • #7
              Originally posted by cram View Post
              I've managed to get the correct quality values by specifying the -B option to samtools mpileup, which disables the "probabilistic realignment for the computation of BAQ". So I'm guessing that mpileup believes there is a misalignment occurring due to an indel. However, when I look at the alignment in IGV it's clearly a SNP, with no indel in sight.
              Ah, I wondered if that setting was germane. I guess it is. And the same with my sequence. The reads look like they align fine to the reference as is, no indel there.

              Thanks.

              Comment


              • #8
                1. BAQ will change the base quality.

                2. even if no reads are mapped with gaps, there may still be an indel in fact. One has to be experienced enough to tell if there is an indel.

                3. With BAQ, you usually get a couple of percent higher false negative rate, but gain greatly in specificity.

                4. The SVN version has a hidden option -E, which saves sensitivity a little.

                5. pileup is deprecated. use mpileup+bcftools instead.

                Comment


                • #9
                  Originally posted by swbarnes2 View Post
                  I have a pileup from samtools (an mpileup too, but they agree with each other) that shows very odd quality scores. My fastq and .sam agree that the quality scores of the letters in the read range from B to h, which is normal.
                  No, that isn't normal.

                  A SAM file should be using Sanger FASTQ scores, but I suspect you have used an unconverted Illumina FASTQ since there B to h (ASCII 66 to 104) would mean PHRED quality 2 to 40 (ASCII offset 64).

                  If your read quality scores are from 2 to 40 (which is reasonable) the quality letters should be ASCI 35 to 73 (ASCII offset 33), or # to I.

                  See http://en.wikipedia.org/wiki/FASTQ_format and http://dx.doi.org/10.1093/nar/gkp1137
                  Last edited by maubp; 01-13-2011, 12:10 PM. Reason: typo

                  Comment


                  • #10
                    2. even if no reads are mapped with gaps, there may still be an indel in fact. One has to be experienced enough to tell if there is an indel.

                    3. With BAQ, you usually get a couple of percent higher false negative rate, but gain greatly in specificity.
                    Is low coverage expected to have a serious adverse affect on the BAQ algorithm accuracy? I have a very large number of sites that are being given these low quality values, and I really cannot see how most of them are suspected of being indels. However, my data is very low coverage, with an average of only about 5x.

                    Comment


                    • #11
                      Originally posted by maubp View Post
                      No, that isn't normal.

                      A SAM file should be using Sanger FASTQ scores, but I suspect you have used an unconverted Illumina FASTQ since there B to h (ASCII 66 to 104) would mean PHRED quality 2 to 40 (ASCII offset 64).

                      If your read quality scores are from 2 to 40 (which is reasonable) the quality letters should be ASCI 35 to 73 (ASCII offset 33), or # to I.
                      Ah, indeed I am. That might be confusing the software, so I'm converting now. I also plan to check some of the SNPs that mpileup refused to call when BAQ calcualtions were included, but mpileup with -B did call. Some of them make velvet contigs that agree with the SNP that was called with -B, which is modest evidence that the SNP is real, and should not be filtered away based on BAQ calculations.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      51 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X