SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Question on calling SNPs using samtools/bcftools nkwuji Bioinformatics 6 02-19-2013 09:52 AM
samtools/mpileup heterozygous SNPs calling combiochem Bioinformatics 4 08-02-2011 08:05 AM
calling Heterozygous SNPs with samtools mpileup egatti Bioinformatics 1 07-21-2011 09:16 AM
SamTools and allelic expression ameyer RNA Sequencing 1 05-15-2011 04:58 AM
calling snps with samtools on novoalign data rcorbett Bioinformatics 1 02-05-2010 07:44 PM

Reply
 
Thread Tools
Old 03-08-2011, 05:37 PM   #1
dagarfield
Member
 
Location: Heidelberg, Germany

Join Date: Aug 2010
Posts: 39
Default Calling tri-allelic SNPs using samtools (or similar)

Hi folks,

We work with sea urchin larvae in our lab. They are very, very, very tiny and, thus, we need to collect a whole bunch of them at a time to get sufficient starting material for NGS. Urchins are also highly polymorphic.

RESULT: There are times in which some SNPs are effectively tri-allelic in a single sample, something that simply isn't ever going to happen if your sample consists of a happily diploid individual human (or medical model system of your choice).

To see what happens when one has three alleles at a polymorphic site, I constructed a fake dataset (which I can provide) consisting of three reads each of three different haplotypes. Using samtools mpileup, I can generate the following line for the base in question

Code:
samtools mpileup -f mySeqs.fa combined.bam > combined.pileup

dgarfield$ less combined.pileup | grep 124217
Scaffold1200	124217	G	9	aaattt,,,	=========p
Great, the program sees that there are three alleles at 124217

Now, lets take a look at the results of bcftools view

Code:
samtools mpileup -uf mySeqs.fa combined.bam > combined.pileup_u

dgarfield$ bcftools view -cg combined.pileup_u | grep 124217
Scaffold1200	124217	.	G	T	19.1	.	DP=9;AF1=0.5;CI95=0.5,0.5;DP4=0,3,0,6;MQ=60;FQ=19.1;PV4=1,1,1,1	GT:PL:GQ	0/1:49,0,49:49
T? That was not what I was expecting. I was hoping for A,T,G

That brings me to my two questions.

1) Given the equal balance of alleles at SNP 124217, why does bcftools choose 'T'?
2) Are there any situations in which bcftools can return more than two alleles at a single SNP?

Any insights would be greatly appreciated.

Thanks,

David
dagarfield is offline   Reply With Quote
Old 03-09-2011, 05:07 AM   #2
dagarfield
Member
 
Location: Heidelberg, Germany

Join Date: Aug 2010
Posts: 39
Default

Here's a response I got from the samtools mailing list...not overly encouraging for Samtools for this problem. Any suggestions for other good SNP calling programs?

Quote:
You should have been hoping for "A,T" not "T" or "A,T,G" because G is the reference so not an alternate allele.

But samtools and bcftools can't handle your situation. The current version always assumes the sample is diploid.
I understand there is some experimentation at handling haploid samples (good for X and Y chromosomes
as well as true haploid situations), but handling high ploidy/arbitrary mixtures is something else that needs its
own calculations with a prior over distributions on the four nucleotides (or more if you want to consider overlapping
indels).
dagarfield is offline   Reply With Quote
Old 03-09-2011, 06:43 AM   #3
dagarfield
Member
 
Location: Heidelberg, Germany

Join Date: Aug 2010
Posts: 39
Default

Another response from Heng Li on the samtools help list

Quote:
Samtools-0.1.13 always assumes the sample is diploid and on a diploid genome, it is impossible to have three different alleles. Nonetheless, you may still see two alternative alleles in a single sample. This indicates that the sample has two alleles but both different from the reference.

However, samtools does not handle triallelic alleles properly. Although you may occasionally see them in the VCF report, the QUAL and the GT are not computed in the proper way. Perhaps glfmultiples is better in this case. Note that glfmultiples also assumes the input is diploid. Multi-ploidy and multi-allele are two different issues.

Heng
dagarfield is offline   Reply With Quote
Old 03-09-2011, 07:41 AM   #4
dagarfield
Member
 
Location: Heidelberg, Germany

Join Date: Aug 2010
Posts: 39
Default

Oh, all kinds of good things. More from Heng Li.

Quote:
Samtools is not designed for pooling experiments. There are a few callers designed for that, but I do not know which is the best. For estimating allele frequency from DNA pools, someone used to point me to:

http://www.ncbi.nlm.nih.gov/pubmed/21253599

I have never read the paper carefully, though.

Heng

And from Manuel Rivas at the Broad

Quote:
Hello David,

You can also use Syzygy for pooled data available at:

http://www.broadinstitute.org/software/syzygy

Best,
Manuel
Manuel distinguishes between Syzygy and GATK

Quote:
Syzygy is used for targeted sequencing applications (customized seq, pooled
seq, and is applicable to whole exome sequencing as well) with both
individual and pooled level data. For small genomes it would work well.

GATK's functionality is for whole genome applications and whole exome
applications with individual level data.
Poking around a bit on the web, it seems like VARiD might be a good option for some kinds of reads, but I've not used it myself. I'd be keen to hear from anyone how knows how VARiD does with pooled samples.

http://compbio.cs.utoronto.ca/varid/

Happy Computing,

DG

Last edited by dagarfield; 03-11-2011 at 07:11 AM.
dagarfield is offline   Reply With Quote
Old 03-16-2011, 01:04 AM   #5
Jose Blanca
Member
 
Location: Valencia, Spain

Join Date: Aug 2009
Posts: 70
Default

We work with mixed samples coming from different individuals and we have developed an SNP caller to work with them. You can take a look at:

http://bioinf.comav.upv.es/ngs_backbone/

Best regards,

Jose Blanca
Jose Blanca is offline   Reply With Quote
Old 03-24-2011, 03:44 PM   #6
mrxcm3
Junior Member
 
Location: Nottingham

Join Date: Oct 2010
Posts: 9
Default

Whilst your specific problem is not one that I have had to consider - I have been working with anonomous (unbarcoded) sample pools. I have found these SNP callers useful;

SYZYGY
FREE BAYES
VarScan

My understanding of VARiD is that it was not suitable for my application (non-barcoded data) in that it treats DNA from the same read group as originating from the same sample. I may have this wrong though.

Good Luck.
mrxcm3 is offline   Reply With Quote
Reply

Tags
bcftools, mpileup, samtools, snp calling

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:59 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO