I have a pacbio CCS fastq file that I have aligned to a small (5037 bp) reference which is part of a sequenced BAC clone. I have aligned this data with multiple aligners (minimap2 and smalt), called variants with multiple variant callers (bbmap, freebayes, and gatk) and viewed the bam file with multiple viewers (igv and tablet). All produce similar results. I will focus here on minimap2 and callvariants (bbmap), as the problem likely stems from the variant caller and seqanswers is the preferred avenue for support for bbmap.
Here is what I did:
A partial reading of callvariants log output:
The single called variant (a deletion at the site of a homopolymer) is reported in the vcf.
To my interpretation of this, a single "c" has been lost from the poly-C which starts at this position. Moreover this deletions occurs in 619 of 1000 reads covering this position.
Now the strange part. When I view the alignment (out.sort.bam) in IGV (or Tablet), I see at most 1-2% indels at our near this site while the rest of the sequence is largely free of indels. Why is there this large discrepancy between what is observed in an alignment viewer and what is called by the variant caller (i.e. there are many more variant reads called than are observed on the viewer?
Thank you for your help
Here is what I did:
Code:
minimap2 -t 5 -ax asm20 reference.fa sample.ccs.fastq > out.sam samtools view -O BAM -o out.bam out.sam && samtools sort -o out.sort.bam out.bam && samtools index out.sort.bam callvariants.sh ploidy=1 in=out.sort.bam ref=reference.fa vcf=out.vcf
Code:
1 of 3311 variants passed primary filters (0.0302%). Type Count Rate AD Depth AF Score Qual Substitutions: 0 0.0% 0.0 0.0 0.000 0.0 0.0 Deletions: 1 100.0% 619.0 1000.0 0.619 23.0 34.0 Insertions: 0 0.0% 0.0 0.0 0.000 0.0 0.0 Variation Rate: 1/5037 Homozygous: 1 100.0%
Code:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT out.sort reference 1401 . Tc T 23.66 PASS SN=0;STA=1401;STO=1402;TYP=DEL;R1P=338;R1M=281;R2P=0;R2M=0;AD=619;DP=1000;MCOV=-1;PPC=0;AF=0.6190;RAF=0.6190;LS=2996497;MQS=37140;MQM=60;BQS=21455;BQM=41;EDS=956749;EDM=4467;IDS=617766;IDM=999;NVC=0;FLG=0;CED=601;HMP=6;SB=0.9997 GT:DP:AD:AF:RAF:NVC:FLG:SB:SC:PF 1:1000:619:0.6190:0.6190:0:0:0.9997:23.66:PASS
Now the strange part. When I view the alignment (out.sort.bam) in IGV (or Tablet), I see at most 1-2% indels at our near this site while the rest of the sequence is largely free of indels. Why is there this large discrepancy between what is observed in an alignment viewer and what is called by the variant caller (i.e. there are many more variant reads called than are observed on the viewer?
Thank you for your help
Comment