SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Impact on quality of SNP calls using samtools mpileup myi Bioinformatics 3 03-03-2014 07:18 AM
Samtools mpileup calls drastically more SNPs with -I agel Bioinformatics 0 01-20-2012 01:20 PM
samtools/mpileup heterozygous SNPs calling combiochem Bioinformatics 4 08-02-2011 07:05 AM
calling Heterozygous SNPs with samtools mpileup egatti Bioinformatics 1 07-21-2011 08:16 AM
samtools mpileup filter SNPs Hit Bioinformatics 3 05-25-2011 04:55 PM

Reply
 
Thread Tools
Old 08-15-2011, 02:58 AM   #1
TuA
Junior Member
 
Location: CH

Join Date: May 2011
Posts: 9
Question samtools mpileup calls way too less SNPs

Dear all,

For some reasons I get way too less SNPs called with samtools mpileup (0.1.17) that I'm supposed to get.

I work with paired-end SOLiD data (partial genome resequencing of a diploid species, ~20x coverage)

Here is the code I'm using:

Code:
samtools mpileup -u -Q 20 -f ref.fa infile.bam | bcftools view -cgbu - > outfile_raw.bcf
bcftools view outfile_raw.bcf | ./vcfutils.pl varFilter -D1000 > outfile.vcf

I tried different versions of samtools, but consistently get the same result.

For example:
For one sample I got 3250 SNPs with samtools and >21000 SNPs with the SNP tool of the CLC genomics workbench (which is close to what I expect) - although the settings of the CLC workbench were in my opinion much more stringent. When I run samtools without the '-Q 20' option, I get only slightly more SNPs (4225). All SNPs I detect with samtools I also find with the CLC workbench. However, with the CLC workbench I detect many more SNPs in addition to them (see examples below).

Settings CLC workbench SNP calling:
- min base coverage: 2x
- min frequency of variant: 15
- max coverage: 1000x
- min variant count: 2
- sufficient variant count: 4 (SNP called also if frequency < 15)

Attached are two example files
- sample output CLC workbench
- sample output samtools mpileup for same genomic region



Any suggestions why this discrepancy is so large?

Many thanks and best wishes
Attached Files
File Type: txt sample_CLCgenomicsWorkbench.txt (1.7 KB, 71 views)
File Type: txt sample_samtools_mpileup.txt (370 Bytes, 155 views)
TuA is offline   Reply With Quote
Old 08-15-2011, 08:17 AM   #2
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Some things to try:

What does the samtools pileup look like at the loci where it didn't call a SNP, but CLC did?

You can try running mpileup with -B or -E. This will alter the BAQ calculations. I have seen on occasion where mpileup will refuse to call sanger-verified SNPs with the BAQ calculations on, which were called once I turned them off.
swbarnes2 is offline   Reply With Quote
Old 08-17-2011, 02:27 PM   #3
dreesbl
Registered Vendor
 
Location: Seattle, WA

Join Date: Nov 2009
Posts: 6
Default

@ TuA

Check the frequency for the SNPs that were found with samtools. One of the options in the mpileup command is:

-p FLOAT A site is considered to be a variant if P(ref|D)<FLOAT [0.5]

It looks like the default behavior is to call only variants with a frequency over 50%.

[email protected]
www.spiralgenetics.com
dreesbl is offline   Reply With Quote
Old 08-17-2011, 03:42 PM   #4
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 694
Default

You can force samtools mpileup with pipe to bcftools to produce lots'o'info like this ...

samtools mpileup -R -d 1000 -l rangehg18.chr -uf hg18.fa in.bam | bcftools view -g - > out.vcf

Note the -g parameter, it forces calls and outputs for all locations.
Check why you're not getting calls in the details of the VCF file.
You can write your own caller based on the VCF information if you need to.
Richard Finney is offline   Reply With Quote
Old 08-24-2011, 07:05 AM   #5
TuA
Junior Member
 
Location: CH

Join Date: May 2011
Posts: 9
Default

Thanks everyone for your suggestions!

I played around with samtools quite a bit now, but didn't get really satisfying results. At the moment I'm therefore a bit suspicious of using samtools for SNP calling.

I looked at the raw.vcf files of samtools (without filtering) to see how the data looks like at positions samtools didn't call a SNP but CLC did (using many different options in samtools, eg '-B' etc..). It seems that samtools is sometimes doing weird things such as calling the reference allele (R) although there is no coverage for the R allele but for an alternative allele (A). Eg:

- DP4 = 0,0,6,0 -> only reference allele called
- DP4 = 1,0,11,0 -> only reference allele called

I guess that most of the differences are somehow linked to the treatment of quality scores.
However, even if I set the CLC settings to high stringency [eg min Q SNP base >40, min Q surrounding bases (window size 11) >20, min coverage 2 etc], I get many more SNPs than with samtools (without 'Q' option).
Almost all SNPs I find with samtools, I also find with CLC. With CLC I just find many more in addition to them. If I run samtools with the '-B' I get much higher DP4 counts (makes sense) but also much higher 'Qual' values (?).

The discrepancy often seems to be due to differences in estimating read counts. Eg:

- samtools R/A counts: 3/0 -> R allele called
- CLC R/A counts: 3/3 -> both alleles called

One could probably say samtools is more conservative, but for everyone interested in an individual's haplotype this can be problematic. For me, the information if individual A is sharing a particular SNP with individual B or if individual B is homozygot for the reference allele is crucial. If samtools is not calling the alternative allele with DP4 = 0,0,6,0, I get a (potentially) wrong data point.

Samtools is apart from GATK the only software (at least that I know of) which provides the 'raw.vcf' format, ie the information for all sequenced sites (not only SNP positions). This format is required to be able to compare data among individuals (population genomics). Working with pure SNP tables, I will not be able to distinguish if a particular polymorphic position has not been sequenced in individual A or if individual A is homozygous for the reference allele.

At the moment I'm now using the CLC workbench with stringent settings and then bedtools to obtain the missing information mentioned above.

Last edited by TuA; 07-11-2012 at 06:53 AM. Reason: forgot to mention GATK
TuA is offline   Reply With Quote
Old 08-24-2011, 07:19 AM   #6
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

I think mpileup does more than just look at the given alignments. It generates a consensus sequence for the alternate allele then tries to map the reads to that. In short many mismatches can occur at the ends of reads due to alignment error, which can be corrected in something resembling a local assembly.
rskr is offline   Reply With Quote
Old 08-24-2011, 07:41 AM   #7
TuA
Junior Member
 
Location: CH

Join Date: May 2011
Posts: 9
Default

Thanks for that input rskr!
TuA is offline   Reply With Quote
Old 08-24-2011, 07:57 AM   #8
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by TuA View Post
Thanks for that input rskr!
I am not 100% sure though, because that is what the manual said for pileup, which is deprecated, I only assume that that functionality is also in mpileup, but I haven't been able to find anywhere that that is the case.
rskr is offline   Reply With Quote
Old 08-25-2011, 12:19 AM   #9
iansealy
Member
 
Location: Hitchin, UK

Join Date: Oct 2010
Posts: 15
Default

Quote:
Originally Posted by TuA View Post
The problem is that samtools is the only software (at least that I know of) which provides the 'raw.vcf' format, ie the information for all sequenced sites (not only SNP positions). This format is required to be able to compare data among individuals (population genomics). Working with pure SNP tables, I will not be able to distinguish if a particular polymorphic position has not been sequenced in individual A or if individual A is homozygous for the reference allele.
The GATK Unified Genotyper can do this too. Check the --output_mode option on http://www.broadinstitute.org/gsa/wi...fied_genotyper. It sounds like you want EMIT_ALL_CONFIDENT_SITES or EMIT_ALL_SITES.

Cheers,
Ian
iansealy is offline   Reply With Quote
Old 08-25-2011, 05:07 AM   #10
Fabien Campagne
Member
 
Location: New York City

Join Date: Feb 2010
Posts: 39
Default

Quote:
Originally Posted by TuA View Post
Thanks everyone for your suggestions!

The problem is that samtools is the only software (at least that I know of) which provides the 'raw.vcf' format, ie the information for all sequenced sites (not only SNP positions). This format is required to be able to compare data among individuals (population genomics).
You can also call genotypes for all sequences positions with Goby. You will need the 'genotypes' output format of the discover-sequence-variants tool to get all sites sequenced.

See the tutorial at http://campagnelab.org/software/goby...ence-variants/

Note that you can convert BAM alignments to Goby format with the tool sam-to-compact.
Fabien Campagne is offline   Reply With Quote
Old 08-26-2011, 10:33 AM   #11
donthu
Junior Member
 
Location: Chicago

Join Date: Oct 2009
Posts: 6
Default

I had similar problem. When I carefully checked my results I found that samtools had output only SNPs from one chromosome. It did not output SNPs from all other chromosomes. I reran mpileup command similar to yours after adding -I. Then I got ten times more SNPs

Last edited by donthu; 08-26-2011 at 10:44 AM.
donthu is offline   Reply With Quote
Old 08-26-2011, 10:43 AM   #12
donthu
Junior Member
 
Location: Chicago

Join Date: Oct 2009
Posts: 6
Default

Quote:
Originally Posted by TuA View Post
Dear all,

For some reasons I get way too less SNPs called with samtools mpileup (0.1.17) that I'm supposed to get.

I work with paired-end SOLiD data (partial genome resequencing of a diploid species, ~20x coverage)

Here is the code I'm using:

Code:
samtools mpileup -u -Q 20 -f ref.fa infile.bam | bcftools view -cgbu - > outfile_raw.bcf
bcftools view outfile_raw.bcf | ./vcfutils.pl varFilter -D1000 > outfile.vcf

I tried different versions of samtools, but consistently get the same result.

For example:
For one sample I got 3250 SNPs with samtools and >21000 SNPs with the SNP tool of the CLC genomics workbench (which is close to what I expect) - although the settings of the CLC workbench were in my opinion much more stringent. When I run samtools without the '-Q 20' option, I get only slightly more SNPs (4225). All SNPs I detect with samtools I also find with the CLC workbench. However, with the CLC workbench I detect many more SNPs in addition to them (see examples below).

Settings CLC workbench SNP calling:
- min base coverage: 2x
- min frequency of variant: 15
- max coverage: 1000x
- min variant count: 2
- sufficient variant count: 4 (SNP called also if frequency < 15)

Attached are two example files
- sample output CLC workbench
- sample output samtools mpileup for same genomic region



Any suggestions why this discrepancy is so large?

Many thanks and best wishes
I had similar problem. When I carefully checked my results I found that samtools had output only SNPs from one chromosome. It did not output SNPs from all other chromosomes. I reran mpileup command similar to yours after adding -I. Then I got ten times more SNPs
donthu is offline   Reply With Quote
Old 10-04-2011, 04:18 AM   #13
cjp
Member
 
Location: Cambridge, United Kingdom

Join Date: Jun 2011
Posts: 58
Default

Quote:
Originally Posted by dreesbl View Post
@ TuA

Check the frequency for the SNPs that were found with samtools. One of the options in the mpileup command is:

-p FLOAT A site is considered to be a variant if P(ref|D)<FLOAT [0.5]

It looks like the default behavior is to call only variants with a frequency over 50%.

[email protected]
www.spiralgenetics.com
Hi,

I was reading this thread and noticed this comment. Here the P(ref|D) is probably referring to probability notation - prob. of ref given D.

In that case, the -p option roughly translates to: "the position is called as a variant if the probability of the site being reference *given* the data observed [using the counts *and* Q-scores of the reference and alternate base calls seen at that position] is less than 0.5".

There is an NGS presentation by Heng Li here that explains it (pages 20-26):

https://www.dropbox.com/s/55nfktmn7l.../HENG%20LI.pdf

On page 23, the final line says P(CC | D) = 0.06 - so as C is the base in the reference sequence - this is the "P(ref|D)" from the -p option. This is important for calling heterozygotes where it is common to see small variations away from seeing exactly 50% bases as reference and 50% as alternate alleles.

Chris
cjp is offline   Reply With Quote
Old 11-07-2011, 08:15 PM   #14
iandbear
Junior Member
 
Location: Australia

Join Date: Jul 2011
Posts: 2
Default

Quote:
Originally Posted by iansealy View Post
The GATK Unified Genotyper can do this too. Check the --output_mode option on http://www.broadinstitute.org/gsa/wi...fied_genotyper. It sounds like you want EMIT_ALL_CONFIDENT_SITES or EMIT_ALL_SITES.

Cheers,
Ian
It seems the option "-out_mode EMIT_ALL_CONFIDENT_SITES" does give monomorphic variants in addition to variants.
If I use the default setting "variant only", at the end of variant calling the INFO given by Unified Genotyper shows % confidently called bases of all bases drop down to 2%.
But with "-out_mode EMIT_ALL_CONFIDENT_SITES", % confidently called bases of all bases jump back to 96%.
Any idea about that? Thanks!
iandbear is offline   Reply With Quote
Old 12-05-2012, 02:55 AM   #15
ekg
Member
 
Location: Boston, MA

Join Date: Apr 2010
Posts: 36
Default

Quote:
Originally Posted by iansealy View Post
The GATK Unified Genotyper can do this too. Check the --output_mode option on http://www.broadinstitute.org/gsa/wi...fied_genotyper. It sounds like you want EMIT_ALL_CONFIDENT_SITES or EMIT_ALL_SITES.

Cheers,
Ian
As can freebayes. By default it outputs all sites of even extremely low confidence, P(site polymorphic | data) >= 0.0001 (parameter -P --pvar).

The intent is that users filter their results downstream, and also that it be almost impossible to run the program and get no output due to default filtering options. Generally, this provides more flexibility than filtering the site prior to reporting, but it does produce copious amounts of output.

I found this thread because samtools is refusing to produce output in a (simulated) case where both freebayes and the gatk work just fine without any tweaks...
ekg is offline   Reply With Quote
Old 12-05-2012, 04:00 AM   #16
ekg
Member
 
Location: Boston, MA

Join Date: Apr 2010
Posts: 36
Default

It appears that samtools is not using reads which are not marked as proper pairs. This was my problem, and it led me to this thread. I fixed it with bamtools resolve.
ekg is offline   Reply With Quote
Old 02-15-2018, 08:32 PM   #17
Soumya18
Junior Member
 
Location: Indianapolis,USA

Join Date: Feb 2018
Posts: 1
Default Need more information!

Dear All,

Could you please let me know what is rangehg18.chr?

Quote:
Originally Posted by Richard Finney View Post
You can force samtools mpileup with pipe to bcftools to produce lots'o'info like this ...

samtools mpileup -R -d 1000 -l rangehg18.chr -uf hg18.fa in.bam | bcftools view -g - > out.vcf

Note the -g parameter, it forces calls and outputs for all locations.
Check why you're not getting calls in the details of the VCF file.
You can write your own caller based on the VCF information if you need to.

Last edited by Soumya18; 02-15-2018 at 08:35 PM.
Soumya18 is offline   Reply With Quote
Old 03-01-2018, 04:17 PM   #18
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 694
Default

The "-l parameter " (i.e. "dash el") specifies a file of (genomic postions) OR (genomic ranges in bed format) to produce output for. Regions OR positions not specified are dropped (not produced in output, i.e. "skipped").


typing "samtools mpileup" on the command line provides information on all the parameters appropriate to the "mpileup" feature of samtools.

bash-3.2$ samtools mpileup

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]

Input options:
-6, --illumina1.3+ quality is in the Illumina-1.3+ encoding
... [ deleted stuff ]
-G, --exclude-RG FILE exclude read groups listed in FILE
-l, --positions FILE skip unlisted positions (chr pos) or regions (BED)
... [ deleted stuff ]
Richard Finney is offline   Reply With Quote
Reply

Tags
clc genomics workbench, samtools mpileup

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO