Hello everybody,
I had a strange observation from the sequencing alignment of our cohort and was wondering whether you could help me. We sequenced several members of a family using Illumina whole exome sequencing, and I aligned the reads with bwa mem and novoalign (without trimming prior to alignment). Within one particular genomic region, which is protein-coding, exonic and very unique (the only hit from BLAT against human genome, and mappability is 1.0), the base quality is really bad only for the reverse strand, not for the plus strand, and this happens to every sample we sequenced. Any base outside of this particular region is totally fine.
Here is a screen shot of the alignment (viewed in UCSC Genome Browser):
Figure shown is about 100bp window.
Within this short genomic region, on the reverse strand, the base quality is consistently lower than 5 for ~95% of the reads, resulting many sequencing errors (as shown in the figure). Only a small fraction (~5%) of the reads from the reverse strand are still high quality for the same string of bases (baseQ>30).
I have been thinking of complex structure variants, lane bias, bad sample handling at the center, etc. but none of those could be the reason because the same sequencing failure was observed across different samples, sequencing platforms (Illumina GAII and HiSeq2000), sequencing centers (we had samples sequenced at two centers), exon capture kids (some samples used NimbleGen and some Agilent), lanes, R1/R2 of the pairs, different aligners. Therefore, it is likely to be intrinsic to the samples themselves. But I couldn't came out with a good explanation. All samples are germline samples from patients who developed tumors.
Any comments and suggestions will be extremely appreciated! Thanks =)
Cheers,
Sonia
I had a strange observation from the sequencing alignment of our cohort and was wondering whether you could help me. We sequenced several members of a family using Illumina whole exome sequencing, and I aligned the reads with bwa mem and novoalign (without trimming prior to alignment). Within one particular genomic region, which is protein-coding, exonic and very unique (the only hit from BLAT against human genome, and mappability is 1.0), the base quality is really bad only for the reverse strand, not for the plus strand, and this happens to every sample we sequenced. Any base outside of this particular region is totally fine.
Here is a screen shot of the alignment (viewed in UCSC Genome Browser):
Figure shown is about 100bp window.
Within this short genomic region, on the reverse strand, the base quality is consistently lower than 5 for ~95% of the reads, resulting many sequencing errors (as shown in the figure). Only a small fraction (~5%) of the reads from the reverse strand are still high quality for the same string of bases (baseQ>30).
I have been thinking of complex structure variants, lane bias, bad sample handling at the center, etc. but none of those could be the reason because the same sequencing failure was observed across different samples, sequencing platforms (Illumina GAII and HiSeq2000), sequencing centers (we had samples sequenced at two centers), exon capture kids (some samples used NimbleGen and some Agilent), lanes, R1/R2 of the pairs, different aligners. Therefore, it is likely to be intrinsic to the samples themselves. But I couldn't came out with a good explanation. All samples are germline samples from patients who developed tumors.
Any comments and suggestions will be extremely appreciated! Thanks =)
Cheers,
Sonia
Comment