SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SHRiMP and SOLiD Qualities agc SOLiD 5 03-18-2013 01:59 PM
1k genomes .bam qualities keebs42 Bioinformatics 2 04-27-2010 08:18 AM
Converting ABI colorspace qualities into base qualities szilva Bioinformatics 7 04-01-2010 11:38 PM
TopHat mapping qualities dychiang Bioinformatics 0 10-20-2009 01:44 PM
Understanding Qualities using MAQ binnaz Illumina/Solexa 1 11-06-2008 01:41 AM

Reply
 
Thread Tools
Old 01-10-2011, 10:41 AM   #1
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default Qualities in pileup disagree with qualities in fastq

I have a pileup from samtools (an mpileup too, but they agree with each other) that shows very odd quality scores. My fastq and .sam agree that the quality scores of the letters in the read range from B to h, which is normal.

The fastq and the .sam agree that the quality of the reads falling across a certain locus are quite high. But the quality scores of the pileup are strange. They range from ! to h, and certain positions are uniformly low, sometimes extremely so.

Here's a sample: The mixed SNP are 4107 looks awful based on the pileup, but going back to the original reads, the quality is great for C's and A's in that position. When I use an older version of samtools (0.1.7) on the same .bam, the pileup qualities are correct. When I use pileup in 0.1.12a, it looks like it does below, and the qualities are wrong.

chr2 69574093 G G 111 0 36 28 .,...,.,..,.,,..,.,,.,,..,.. QhhfchhhhhhhhhhahhghhhhhehNN
chr2 69574094 A A 111 0 36 28 .,...,.,..,.,,..,.,,.,,..,.. OhhgShhhhhhfhhc]fgghhhhfehNN
chr2 69574095 G G 111 0 36 28 .,...,.,..,.,,..,.,,.,,..,.. NhhgahhhhhhhhhhbhhghhhhhchOO
chr2 69574096 A A 111 0 36 28 .,...,.,..,.,,..,.,,.,,..,.. NhhfWhhhghhhhhh`hhgghhhhfhQQ
chr2 69574097 G G 114 0 36 29 .$,...,.,..,.,,..,.,,.,,..,..^F. DhhgahhhhhhhehhadhghhhhhahRRD
chr2 69574098 A A 111 0 36 28 ,...,.,..,.,,..,.,,.,,..,... hhdYhhhghhhehhahgghhhggchUUI
chr2 69574099 T T 111 0 36 28 ,...,.,..,.,,..,.,,.,,..,... aaaaaaaaaaaaaaaaaaaaaaaaaaa_
chr2 69574100 T T 114 0 36 29 ,...,.,..,.,,..,.,,.,,..,...^F, _aaaaaaaaaaaaaaaaaaaaaaaaaaaE
chr2 69574101 G G 117 0 36 30 ,...,.,..,.,,..,.,,.,,..,...,^F, UhgchhhhhhhehhcchfhhhhhfhhhgUE
chr2 69574102 A A 123 0 36 32 ,$...,.,..,.,,..,.,,.,,..,...,,^F.^:, >EMY\\O\\\O5OPPPO\O\\POP\OS\\U?@
chr2 69574103 A A 120 0 36 31 ...,.,..,.,,..,.,,.,,..,...,,., BJZZZLZZZL2LMMMLZLZZMLMZLPZZ[AA
chr2 69574104 A A 103 0 36 31 ...,.,..,.c,..,.,,.,,..,...,,., AFZZZJZZZK0KKKKKZKZZKKKZKNZZ[BB
chr2 69574105 A A 120 0 36 32 .$..,.,..,.,,..,.,,.,,..,...,,.,^:. ?:\\\?\\\@5@@@@@\@\\@@@\@D\\]E?#
chr2 69574106 C C 75 0 37 33 ..,.,..,.,,..,.,,.,,..,...,,.,.^:.^F, %^^a!aaa!J!!!!!a!aa!!!a!'aaaa!#!B
chr2 69574107 C C 78 0 37 35 A$.,.a..,A,aAAaA,a.,aAA,AA.,,.aAA,^F.^:a !UUa!aaa!\!!!!!`!aa!!!a!%aaaa!!!EE!
chr2 69574108 A A 78 0 37 34 .$,$.,..,.,,..,.,,.,,..,...,,.,..,., CCU!ehh!h!!!!!Z!hh!!!h!'ghhh!!$cU!
chr2 69574109 G G 123 0 35 32 .,..,.,,..,.,,.,,..,...,,.,..,., R4bhhPhPPPPPgPhhQPQhPTghhhQGPhe?
chr2 69574110 A A 123 0 35 32 .$,$..,.,,..,.,,.,,..,...,,.,..,., >4>\\P\PPPPP\P\\QPP\PT\\\\PPR\\A
chr2 69574111 A A 117 0 35 30 .$.,.,,..,.,,.,,..,...,,.,..,., ;ZZVZVVVVVZVZZWVWZVXZZZZWUYZZD
chr2 69574112 A A 114 0 35 29 .,.,,..,.,,.,,..,...,,.,..,., YYYZYYYYYZYZZYYYZYZZZZZYYYZZO
chr2 69574113 A A 117 0 35 30 .,.,,..,.,,.,,..,...,,.,..,.,^F. TT\\\\\\\\\\\\\\\\\\\\\\\\\\SE
chr2 69574114 C C 117 0 35 30 .,.,,..,.,,.,,..,...,,.,..,.,. RRhhhhgghghhhahZghhchhhfhhhhhR
chr2 69574115 A A 117 0 35 30 .$,$.,,..,.,,.,,..,...,,.,..,.,. DDhhhhghgahhhgeVhhhghhhfhhghhU
chr2 69574116 G G 114 0 35 29 .,,..,.,,.,,..,...,,.,..,.,.^F. hhbh`fhghhhhhOhfhghhgahhhhhhE
chr2 69574117 A A 117 0 35 30 .,,..,.,,.,,..,...,,.,..,.,..^:, ah_cchhghhhdgThghghhhdhhhhhhU@
chr2 69574118 T T 111 0 35 30 .,,..,.,,.,,.C,...,,.,..,.,.., Bh2PQQQgQhgQ+-hQQahhhVQQhhQhbB
chr2 69574119 G G 114 0 35 31 .,,..,.,,.,,..,...,,.,..,.,..,^F. 0e/LMMMgMghM'-hMMghhhRMMfhMheAB
chr2 69574120 C C 72 0 37 32 .,,..,.,,.,,..,...,,.,..,.,..,.^F. ,E+!!!!a!aa!$+a!!aaaa'!!aa!aa!BB
chr2 69574121 C C 75 0 37 33 T$,$tTTtT,t.,tTT,TT.,,.tTT,.t..t..^F. !A!!!!!a!aa!!+a!!aaaa'!!aa!aa!BEE
chr2 69574122 G G 117 0 35 32 ,$..,.,,.,,..,...,,.,..,.,..,...^:. !;<;;g;hh;!0h;;ahhh@;;gh;hh;BbRE
chr2 69574123 C C 117 0 35 31 ..,.,,.,,..,...,,.,..,.,..,.... ;<;;g;hh;!2h;;dggh@;;hh;hh;BeUQ
chr2 69574124 T T 120 0 34 32 ..,.,,.,,C.,...,,.,..,.,..,....^:. AQQQgQhhQ!7hQQdhhhVQQhhQhhPBgheE
chr2 69574125 A A 126 0 34 33 .$.,.,,.,,..,...,,.,..,.,..,.....^:. @dhhfhghfNahhfdghhdhhhhhhhhBeghUE

Last edited by swbarnes2; 01-10-2011 at 02:24 PM. Reason: adding more information
swbarnes2 is offline   Reply With Quote
Old 01-10-2011, 03:23 PM   #2
cram
Member
 
Location: Canada

Join Date: Nov 2008
Posts: 16
Default

I have a very similar problem and it seems to affect not only the pileup output but the VCF as well.

I have a site which appears to be a SNP on visual inspection - all five reads at the location are T rather than the reference C. The base qualities at the site are very high in the fastq and SAM. The mapping qualities for the reads are also very good. The pileup however claims that the qualities of the bases at that site are all 0.

The VCF fails to recognize the existence of an alternate base at this location and has all '0' values for the I16 and PL fields. The '0' values also occur at the following two bases. This is not the output of bcftool's variant detection, this is just the raw output of samtools mpileup -ugf.

Here's what the VCF looks like (putative SNP is at 24,335,869):

Code:
Chr5    24335868        .       G       X       0       .       DP=6;I16=5,1,0,0,172,5060,0,0,222,8214,0,0,78,1796,0,0  PL      0,18,130
Chr5    24335869        .       C       X       0       .       DP=5;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0        PL      0,0,0
Chr5    24335870        .       T       X       0       .       DP=4;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0        PL      0,0,0
Chr5    24335871        .       T       X       0       .       DP=4;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0        PL      0,0,0
Chr5    24335872        .       C       X       0       .       DP=4;I16=3,1,0,0,220,12100,0,0,148,5476,0,0,76,1626,0,0 PL      0,12,129
Can anyone explain why this is occurring?
cram is offline   Reply With Quote
Old 01-10-2011, 04:04 PM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Yes, the vcf would be affected, since it's the same mpileup program. And that's how I caught this; I had a list of variants from the old pileup software, then I updated to the new software, and some old SNPs were not in the vcf. When I checked the new pileup, I saw that it thought that the quality of the SNPs was crap, so I tried to see why the old variant caller had allowed it, and then I realized that the old variant caller was correctly assigning the quality scores, while the new pileup and mpileup were making them far lower than the fastq. Maybe it's doing that for a reason? Related to mapping quality? But I can't see how or why.
swbarnes2 is offline   Reply With Quote
Old 01-10-2011, 05:29 PM   #4
csoong
Member
 
Location: Connecticut

Join Date: Jun 2009
Posts: 74
Default

was illumina quality converted to standard sanger quality?
csoong is offline   Reply With Quote
Old 01-11-2011, 09:57 AM   #5
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

I used the sam2vcf.pl script in the misc folder, and that seems to be working properly, giving it the results of the older version of pileup, where the letter qualities match the fastq, instead of the altered quality scores in the mpileup or pileup. I checked, and the good mixed SNPs that were visible in my old pileup variant file, and not in the new pileup or mpileup files (or the vcfs derived from those) are now visible. Unfortunately, that perl script makes vcf files in an older version; 3.3, instead of 4.0.

But it's a decent workaround.

But does anyone know why the letter quality scores in the new pileups and mpileups don't match the letter quality scores from the fastq?
swbarnes2 is offline   Reply With Quote
Old 01-11-2011, 03:55 PM   #6
cram
Member
 
Location: Canada

Join Date: Nov 2008
Posts: 16
Default

I've managed to get the correct quality values by specifying the -B option to samtools mpileup, which disables the "probabilistic realignment for the computation of BAQ". So I'm guessing that mpileup believes there is a misalignment occurring due to an indel. However, when I look at the alignment in IGV it's clearly a SNP, with no indel in sight.
cram is offline   Reply With Quote
Old 01-12-2011, 09:47 AM   #7
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by cram View Post
I've managed to get the correct quality values by specifying the -B option to samtools mpileup, which disables the "probabilistic realignment for the computation of BAQ". So I'm guessing that mpileup believes there is a misalignment occurring due to an indel. However, when I look at the alignment in IGV it's clearly a SNP, with no indel in sight.
Ah, I wondered if that setting was germane. I guess it is. And the same with my sequence. The reads look like they align fine to the reference as is, no indel there.

Thanks.
swbarnes2 is offline   Reply With Quote
Old 01-12-2011, 01:17 PM   #8
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

1. BAQ will change the base quality.

2. even if no reads are mapped with gaps, there may still be an indel in fact. One has to be experienced enough to tell if there is an indel.

3. With BAQ, you usually get a couple of percent higher false negative rate, but gain greatly in specificity.

4. The SVN version has a hidden option -E, which saves sensitivity a little.

5. pileup is deprecated. use mpileup+bcftools instead.
lh3 is offline   Reply With Quote
Old 01-13-2011, 11:08 AM   #9
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by swbarnes2 View Post
I have a pileup from samtools (an mpileup too, but they agree with each other) that shows very odd quality scores. My fastq and .sam agree that the quality scores of the letters in the read range from B to h, which is normal.
No, that isn't normal.

A SAM file should be using Sanger FASTQ scores, but I suspect you have used an unconverted Illumina FASTQ since there B to h (ASCII 66 to 104) would mean PHRED quality 2 to 40 (ASCII offset 64).

If your read quality scores are from 2 to 40 (which is reasonable) the quality letters should be ASCI 35 to 73 (ASCII offset 33), or # to I.

See http://en.wikipedia.org/wiki/FASTQ_format and http://dx.doi.org/10.1093/nar/gkp1137

Last edited by maubp; 01-13-2011 at 11:10 AM. Reason: typo
maubp is offline   Reply With Quote
Old 01-13-2011, 11:56 AM   #10
cram
Member
 
Location: Canada

Join Date: Nov 2008
Posts: 16
Default

Quote:
2. even if no reads are mapped with gaps, there may still be an indel in fact. One has to be experienced enough to tell if there is an indel.

3. With BAQ, you usually get a couple of percent higher false negative rate, but gain greatly in specificity.
Is low coverage expected to have a serious adverse affect on the BAQ algorithm accuracy? I have a very large number of sites that are being given these low quality values, and I really cannot see how most of them are suspected of being indels. However, my data is very low coverage, with an average of only about 5x.
cram is offline   Reply With Quote
Old 01-13-2011, 03:12 PM   #11
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by maubp View Post
No, that isn't normal.

A SAM file should be using Sanger FASTQ scores, but I suspect you have used an unconverted Illumina FASTQ since there B to h (ASCII 66 to 104) would mean PHRED quality 2 to 40 (ASCII offset 64).

If your read quality scores are from 2 to 40 (which is reasonable) the quality letters should be ASCI 35 to 73 (ASCII offset 33), or # to I.
Ah, indeed I am. That might be confusing the software, so I'm converting now. I also plan to check some of the SNPs that mpileup refused to call when BAQ calcualtions were included, but mpileup with -B did call. Some of them make velvet contigs that agree with the SNP that was called with -B, which is modest evidence that the SNP is real, and should not be filtered away based on BAQ calculations.
swbarnes2 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:10 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO