Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • samtools mpileup calls way too less SNPs

    Dear all,

    For some reasons I get way too less SNPs called with samtools mpileup (0.1.17) that I'm supposed to get.

    I work with paired-end SOLiD data (partial genome resequencing of a diploid species, ~20x coverage)

    Here is the code I'm using:

    Code:
    samtools mpileup -u -Q 20 -f ref.fa infile.bam | bcftools view -cgbu - > outfile_raw.bcf
    bcftools view outfile_raw.bcf | ./vcfutils.pl varFilter -D1000 > outfile.vcf

    I tried different versions of samtools, but consistently get the same result.

    For example:
    For one sample I got 3250 SNPs with samtools and >21000 SNPs with the SNP tool of the CLC genomics workbench (which is close to what I expect) - although the settings of the CLC workbench were in my opinion much more stringent. When I run samtools without the '-Q 20' option, I get only slightly more SNPs (4225). All SNPs I detect with samtools I also find with the CLC workbench. However, with the CLC workbench I detect many more SNPs in addition to them (see examples below).

    Settings CLC workbench SNP calling:
    - min base coverage: 2x
    - min frequency of variant: 15
    - max coverage: 1000x
    - min variant count: 2
    - sufficient variant count: 4 (SNP called also if frequency < 15)

    Attached are two example files
    - sample output CLC workbench
    - sample output samtools mpileup for same genomic region



    Any suggestions why this discrepancy is so large?

    Many thanks and best wishes
    Attached Files

  • #2
    Some things to try:

    What does the samtools pileup look like at the loci where it didn't call a SNP, but CLC did?

    You can try running mpileup with -B or -E. This will alter the BAQ calculations. I have seen on occasion where mpileup will refuse to call sanger-verified SNPs with the BAQ calculations on, which were called once I turned them off.

    Comment


    • #3
      @ TuA

      Check the frequency for the SNPs that were found with samtools. One of the options in the mpileup command is:

      -p FLOAT A site is considered to be a variant if P(ref|D)<FLOAT [0.5]

      It looks like the default behavior is to call only variants with a frequency over 50%.

      [email protected]
      www.spiralgenetics.com

      Comment


      • #4
        You can force samtools mpileup with pipe to bcftools to produce lots'o'info like this ...

        samtools mpileup -R -d 1000 -l rangehg18.chr -uf hg18.fa in.bam | bcftools view -g - > out.vcf

        Note the -g parameter, it forces calls and outputs for all locations.
        Check why you're not getting calls in the details of the VCF file.
        You can write your own caller based on the VCF information if you need to.

        Comment


        • #5
          Thanks everyone for your suggestions!

          I played around with samtools quite a bit now, but didn't get really satisfying results. At the moment I'm therefore a bit suspicious of using samtools for SNP calling.

          I looked at the raw.vcf files of samtools (without filtering) to see how the data looks like at positions samtools didn't call a SNP but CLC did (using many different options in samtools, eg '-B' etc..). It seems that samtools is sometimes doing weird things such as calling the reference allele (R) although there is no coverage for the R allele but for an alternative allele (A). Eg:

          - DP4 = 0,0,6,0 -> only reference allele called
          - DP4 = 1,0,11,0 -> only reference allele called

          I guess that most of the differences are somehow linked to the treatment of quality scores.
          However, even if I set the CLC settings to high stringency [eg min Q SNP base >40, min Q surrounding bases (window size 11) >20, min coverage 2 etc], I get many more SNPs than with samtools (without 'Q' option).
          Almost all SNPs I find with samtools, I also find with CLC. With CLC I just find many more in addition to them. If I run samtools with the '-B' I get much higher DP4 counts (makes sense) but also much higher 'Qual' values (?).

          The discrepancy often seems to be due to differences in estimating read counts. Eg:

          - samtools R/A counts: 3/0 -> R allele called
          - CLC R/A counts: 3/3 -> both alleles called

          One could probably say samtools is more conservative, but for everyone interested in an individual's haplotype this can be problematic. For me, the information if individual A is sharing a particular SNP with individual B or if individual B is homozygot for the reference allele is crucial. If samtools is not calling the alternative allele with DP4 = 0,0,6,0, I get a (potentially) wrong data point.

          Samtools is apart from GATK the only software (at least that I know of) which provides the 'raw.vcf' format, ie the information for all sequenced sites (not only SNP positions). This format is required to be able to compare data among individuals (population genomics). Working with pure SNP tables, I will not be able to distinguish if a particular polymorphic position has not been sequenced in individual A or if individual A is homozygous for the reference allele.

          At the moment I'm now using the CLC workbench with stringent settings and then bedtools to obtain the missing information mentioned above.
          Last edited by TuA; 07-11-2012, 06:53 AM. Reason: forgot to mention GATK

          Comment


          • #6
            I think mpileup does more than just look at the given alignments. It generates a consensus sequence for the alternate allele then tries to map the reads to that. In short many mismatches can occur at the ends of reads due to alignment error, which can be corrected in something resembling a local assembly.

            Comment


            • #7
              Thanks for that input rskr!

              Comment


              • #8
                Originally posted by TuA View Post
                Thanks for that input rskr!
                I am not 100% sure though, because that is what the manual said for pileup, which is deprecated, I only assume that that functionality is also in mpileup, but I haven't been able to find anywhere that that is the case.

                Comment


                • #9
                  Originally posted by TuA View Post
                  The problem is that samtools is the only software (at least that I know of) which provides the 'raw.vcf' format, ie the information for all sequenced sites (not only SNP positions). This format is required to be able to compare data among individuals (population genomics). Working with pure SNP tables, I will not be able to distinguish if a particular polymorphic position has not been sequenced in individual A or if individual A is homozygous for the reference allele.
                  The GATK Unified Genotyper can do this too. Check the --output_mode option on http://www.broadinstitute.org/gsa/wi...fied_genotyper. It sounds like you want EMIT_ALL_CONFIDENT_SITES or EMIT_ALL_SITES.

                  Cheers,
                  Ian

                  Comment


                  • #10
                    Originally posted by TuA View Post
                    Thanks everyone for your suggestions!

                    The problem is that samtools is the only software (at least that I know of) which provides the 'raw.vcf' format, ie the information for all sequenced sites (not only SNP positions). This format is required to be able to compare data among individuals (population genomics).
                    You can also call genotypes for all sequences positions with Goby. You will need the 'genotypes' output format of the discover-sequence-variants tool to get all sites sequenced.

                    See the tutorial at http://campagnelab.org/software/goby...ence-variants/

                    Note that you can convert BAM alignments to Goby format with the tool sam-to-compact.

                    Comment


                    • #11
                      I had similar problem. When I carefully checked my results I found that samtools had output only SNPs from one chromosome. It did not output SNPs from all other chromosomes. I reran mpileup command similar to yours after adding -I. Then I got ten times more SNPs
                      Last edited by donthu; 08-26-2011, 10:44 AM.

                      Comment


                      • #12
                        Originally posted by TuA View Post
                        Dear all,

                        For some reasons I get way too less SNPs called with samtools mpileup (0.1.17) that I'm supposed to get.

                        I work with paired-end SOLiD data (partial genome resequencing of a diploid species, ~20x coverage)

                        Here is the code I'm using:

                        Code:
                        samtools mpileup -u -Q 20 -f ref.fa infile.bam | bcftools view -cgbu - > outfile_raw.bcf
                        bcftools view outfile_raw.bcf | ./vcfutils.pl varFilter -D1000 > outfile.vcf

                        I tried different versions of samtools, but consistently get the same result.

                        For example:
                        For one sample I got 3250 SNPs with samtools and >21000 SNPs with the SNP tool of the CLC genomics workbench (which is close to what I expect) - although the settings of the CLC workbench were in my opinion much more stringent. When I run samtools without the '-Q 20' option, I get only slightly more SNPs (4225). All SNPs I detect with samtools I also find with the CLC workbench. However, with the CLC workbench I detect many more SNPs in addition to them (see examples below).

                        Settings CLC workbench SNP calling:
                        - min base coverage: 2x
                        - min frequency of variant: 15
                        - max coverage: 1000x
                        - min variant count: 2
                        - sufficient variant count: 4 (SNP called also if frequency < 15)

                        Attached are two example files
                        - sample output CLC workbench
                        - sample output samtools mpileup for same genomic region



                        Any suggestions why this discrepancy is so large?

                        Many thanks and best wishes
                        I had similar problem. When I carefully checked my results I found that samtools had output only SNPs from one chromosome. It did not output SNPs from all other chromosomes. I reran mpileup command similar to yours after adding -I. Then I got ten times more SNPs

                        Comment


                        • #13
                          Originally posted by dreesbl View Post
                          @ TuA

                          Check the frequency for the SNPs that were found with samtools. One of the options in the mpileup command is:

                          -p FLOAT A site is considered to be a variant if P(ref|D)<FLOAT [0.5]

                          It looks like the default behavior is to call only variants with a frequency over 50%.

                          [email protected]
                          www.spiralgenetics.com
                          Hi,

                          I was reading this thread and noticed this comment. Here the P(ref|D) is probably referring to probability notation - prob. of ref given D.

                          In that case, the -p option roughly translates to: "the position is called as a variant if the probability of the site being reference *given* the data observed [using the counts *and* Q-scores of the reference and alternate base calls seen at that position] is less than 0.5".

                          There is an NGS presentation by Heng Li here that explains it (pages 20-26):



                          On page 23, the final line says P(CC | D) = 0.06 - so as C is the base in the reference sequence - this is the "P(ref|D)" from the -p option. This is important for calling heterozygotes where it is common to see small variations away from seeing exactly 50% bases as reference and 50% as alternate alleles.

                          Chris

                          Comment


                          • #14
                            Originally posted by iansealy View Post
                            The GATK Unified Genotyper can do this too. Check the --output_mode option on http://www.broadinstitute.org/gsa/wi...fied_genotyper. It sounds like you want EMIT_ALL_CONFIDENT_SITES or EMIT_ALL_SITES.

                            Cheers,
                            Ian
                            It seems the option "-out_mode EMIT_ALL_CONFIDENT_SITES" does give monomorphic variants in addition to variants.
                            If I use the default setting "variant only", at the end of variant calling the INFO given by Unified Genotyper shows % confidently called bases of all bases drop down to 2%.
                            But with "-out_mode EMIT_ALL_CONFIDENT_SITES", % confidently called bases of all bases jump back to 96%.
                            Any idea about that? Thanks!

                            Comment


                            • #15
                              Originally posted by iansealy View Post
                              The GATK Unified Genotyper can do this too. Check the --output_mode option on http://www.broadinstitute.org/gsa/wi...fied_genotyper. It sounds like you want EMIT_ALL_CONFIDENT_SITES or EMIT_ALL_SITES.

                              Cheers,
                              Ian
                              As can freebayes. By default it outputs all sites of even extremely low confidence, P(site polymorphic | data) >= 0.0001 (parameter -P --pvar).

                              The intent is that users filter their results downstream, and also that it be almost impossible to run the program and get no output due to default filtering options. Generally, this provides more flexibility than filtering the site prior to reporting, but it does produce copious amounts of output.

                              I found this thread because samtools is refusing to produce output in a (simulated) case where both freebayes and the gatk work just fine without any tweaks...

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              74 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              82 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X