Hey all,
I am trying to call SNPs on tophat output and am missing known SNPs with good coverage. I hoping to get advice on tuning samtools/bcftools for RNA-seq data or recommendations for alternative pipelines.
I have directional, polyA-selected, paired-end 101bp libraries sequenced on an Illumina HiSeq machine. I t used tophat (v1.1.4) to map trimmed reads and samtools (version 0.1.11 (r851)) to sort the accepted.hits.bam file and used mpileup/bcftools (I also tried pileup) to call variants. I followed the suggestions in the samtools manual (samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf), except I played with the -D parameter a bit.
Using these parameters I call around 22,000 snps. However, I miss around half of a set of known snps with genes.
As an example, a T to C snp, which I know is in my sample from Sanger on the DNA, didn't show up in this analysis. Even if I call pileup with only -c -f, I can't find this position, which doesn't make any sense to me because I was under the impression that omitting -v would provide information on all covered positions. Using tview, I know that I have 119 reads that cover the position in the bam file and 38 are ref, 79 are C and 2 are G.
This maybe as good as it gets, but I would appreciate any suggestions on altering samtools parameters or alternative methods. Thanks!
I am trying to call SNPs on tophat output and am missing known SNPs with good coverage. I hoping to get advice on tuning samtools/bcftools for RNA-seq data or recommendations for alternative pipelines.
I have directional, polyA-selected, paired-end 101bp libraries sequenced on an Illumina HiSeq machine. I t used tophat (v1.1.4) to map trimmed reads and samtools (version 0.1.11 (r851)) to sort the accepted.hits.bam file and used mpileup/bcftools (I also tried pileup) to call variants. I followed the suggestions in the samtools manual (samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf), except I played with the -D parameter a bit.
Using these parameters I call around 22,000 snps. However, I miss around half of a set of known snps with genes.
As an example, a T to C snp, which I know is in my sample from Sanger on the DNA, didn't show up in this analysis. Even if I call pileup with only -c -f, I can't find this position, which doesn't make any sense to me because I was under the impression that omitting -v would provide information on all covered positions. Using tview, I know that I have 119 reads that cover the position in the bam file and 38 are ref, 79 are C and 2 are G.
This maybe as good as it gets, but I would appreciate any suggestions on altering samtools parameters or alternative methods. Thanks!
Comment