I'm aligning color-space reads to the NBCI36 prebuilt index using bowtie. I'll get to my question at the end, but I ask you put up with me setting up the problem initially. Here's one read I'll use as an example for something strange I'm seeing globally in my assembly:
from the .csfasta:
from the .qual:
After conversion to fastq using the BFast script solid2fastq 0.6.3d, I have the following:
So far, so good. Now, using bowtie with the following parameters:
this read aligns uniquely to chromosome 5. Here is the SAM line:
Wonderful! Now, getting to the point, looking a little more closely at the line-up of the original color-space read sequence, the base sequence, and the two quality strings:
I'm having difficulty seeing how these four rows should line up. It seems clear that I've misaligned the color sequence with the base sequence, but I can't seem to find the alignment. The reason I"m looking at this is so carefully is that 4th 'A' and it's base quality,'!' (0). If my alignment is correct, the color qualities flanking this base are 20 and 20, well within the norm for the rest of the read, but only the preceding G and this A have base-qualities of 0.
Now I"ll point out that every base in this read matches the reference, save this A. Blasting the read confirms this mapping location. Actually, there are several lines of evidence indicating the individual is heterozygous at this site with the C being the reference allele. What is notable is there are 37 other reads calling an 'A' at this site, and each of them similarly has a phred quality of 0 only at this alignment position, occasionally along with a position adjacent to it. Furthermore this seems to be a data-set wide phenomena.. every base that mismatches the reference is showing up as having 0 quality. The result of this of course is in my consensus sequence, I am calling no snps since all of the non-reference nucleotides are only supported by 0-quality bases.
Any suggestions about what might be going on here? I don't think "all of the variable sites really are sequencing errors" is a satisfying answer. Why would the phred base quality not rise and fall with the color quality?
from the .csfasta:
Code:
>611_1320_1660 T02223020230322302111212131103031121202322002111112
Code:
>611_1320_1660_F3 31 31 29 31 28 29 33 22 32 28 33 32 28 30 32 30 29 23 29 20 20 22 20 24 26 29 33 30 23 4 16 30 25 29 6 25 14 29 17 22 27 32 25 22 6 30 28 27 20 11
Code:
@611_1320_1660 T02223020230322302111212131103031121202322002111112 + @@>@=>B7A=BA=?A?>8>55759;>B?8%1?:>':/>27<A:7'?=<5,
Code:
bowtie_0.12.2/bowtie -t -C -S --chunkmbs 256 --nomaqround --best -n 2 -e 90 -l 28 -p 8 h_sapiens_asm_c reads_91.fastq bowtie_run3.sam
Code:
611_1320_1660 0 gi|51511721|ref|NC_000005.8|NC_000005 177633203 255 48M * 0 0 CTCGGAAGCCGAGCCTGTGACTGCACCGGCACTGAAGCTCCCTGTGTG ]\Z_XW]^b][__\UURI!!%SX_`V<5OXWD@HLOHR\ZP=E[XP@, XA:i:2 MD:Z:19C28 NM:i:1 CM:i:2
Code:
C T C G G A A G C C G A G C C T G T G A C T G C A C C G G C A C T G A A G C T C C C T G T G T G ] \ Z _ X W ] ^ b ] [ _ _ \ U U R I ! ! % S X _ ` V < 5 O X W D @ H L O H R \ Z P = E [ X P @ , @ @ > @ = > B 7 A = B A = ? A ? > 8 > 5 5 7 5 9 ; > B ? 8 % 1 ? : > ' : / > 2 7 < A : 7 ' ? = < 5 , T 0 2 2 2 3 0 2 0 2 3 0 3 2 2 3 0 2 1 1 1 2 1 2 1 3 1 1 0 3 0 3 1 1 2 1 2 0 2 3 2 2 0 0 2 1 1 1 1 1 2
Now I"ll point out that every base in this read matches the reference, save this A. Blasting the read confirms this mapping location. Actually, there are several lines of evidence indicating the individual is heterozygous at this site with the C being the reference allele. What is notable is there are 37 other reads calling an 'A' at this site, and each of them similarly has a phred quality of 0 only at this alignment position, occasionally along with a position adjacent to it. Furthermore this seems to be a data-set wide phenomena.. every base that mismatches the reference is showing up as having 0 quality. The result of this of course is in my consensus sequence, I am calling no snps since all of the non-reference nucleotides are only supported by 0-quality bases.
Any suggestions about what might be going on here? I don't think "all of the variable sites really are sequencing errors" is a satisfying answer. Why would the phred base quality not rise and fall with the color quality?
Comment