Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • d17
    Member
    • Sep 2008
    • 27

    calling SNPs in haploid genomes

    Does anyone have any thoughts on calling SNPs from short read data (e.g. Illumina) in haploid genomes? It seems that many SNP calling programs are set up to deal only with diploid genomes (e.g. GATK's UnifiedGenotyper).

    I found the program FreeBayes from the Marth Lab which allows you to specify the ploidy. This looks like a good candidate and I will definitely try it. It appears to be unpublished.

    Does anyone have any experience with calling SNPs in haploid genomes using FreeBayes or another program?

    Thanks!
  • flipwell
    Member
    • Feb 2011
    • 14

    #2
    Did you try FreeBayes? I'm facing this problem now and wondering what to use. I've tried GATK and it does appear to work (very superficial examination) but am concerned there might be issues I'm not seeing

    Comment

    • d17
      Member
      • Sep 2008
      • 27

      #3
      Originally posted by flipwell View Post
      Did you try FreeBayes? I'm facing this problem now and wondering what to use. I've tried GATK and it does appear to work (very superficial examination) but am concerned there might be issues I'm not seeing
      I did try FreeBayes, and I was able to get it to work over a small region for a single sample, but when I expanded to call SNPs over whole chromosomes for several samples at once it no longer worked (seemed to hang/freeze and didn't provide any error messages).

      What I ended up doing was using GATK's UnifiedGenotyper, manually extracting the likelihoods for both of the homozygote genotypes, and calling a SNP if the likelihood of the alternative allele was above a certain amount higher than the likelihood of the reference allele (I believe I required the likelihood of the alt allele to be at least 3X greater than the ref allele, although I haven't tested extensively to find the best threshold).

      Comment

      • gaffa
        Member
        • Oct 2010
        • 82

        #4
        I have used FreeBayes on haploid sequences with good results; it is recommended.

        Comment

        • d17
          Member
          • Sep 2008
          • 27

          #5
          Originally posted by gaffa View Post
          I have used FreeBayes on haploid sequences with good results; it is recommended.
          Could you be more specific about what you mean by good results? Did you compare FreeBayes to any other programs?

          Comment

          • wanguan2000
            Member
            • Nov 2010
            • 24

            #6
            I confused the freebayes's commdane:
            -H --diploid-reference
            If using the reference sequence as a sample (default),
            treat it as diploid. default: false (reference is haploid)
            My understanding is this:
            human(diploid) -H flase
            bacteria(haploid) -H true
            but I found a lot of heterozygous snp in bacteria (50%)in my result of vcf file.
            What is wrong?

            Comment

            • gaffa
              Member
              • Oct 2010
              • 82

              #7
              Originally posted by wanguan2000 View Post
              I confused the freebayes's commdane:
              -H --diploid-reference
              If using the reference sequence as a sample (default),
              treat it as diploid. default: false (reference is haploid)
              My understanding is this:
              human(diploid) -H flase
              bacteria(haploid) -H true
              but I found a lot of heterozygous snp in bacteria (50%)in my result of vcf file.
              What is wrong?
              You also need to set the ploidy of the sample, using the p flag (i.e. -p 1. Default is 2).

              Comment

              • wanguan2000
                Member
                • Nov 2010
                • 24

                #8
                I confused the freebayes's commdane:
                -H --diploid-reference
                If using the reference sequence as a sample (default),
                treat it as diploid. default: false (reference is haploid)
                ###########
                My understanding is this:
                human(diploid) -H flase
                bacteria(haploid) -H true
                but I found a lot of heterozygous snp in bacteria (50%)in my result of vcf file.


                My another understanding is:
                human(23 chromsomes) -H flase
                human(23*2 chromsomes) -H true

                which is true?
                ######################
                -p --ploidy N Sets the default ploidy for the analysis to N. default: 2
                ###
                For haploid just set -p 1,and not need to set -H ?

                Comment

                • garwuf
                  Junior Member
                  • Mar 2009
                  • 7

                  #9
                  I gave quite an extensive try to Freebayes recently, and wouldn't recommend it in its current state. I have tried it on several bacterial datasets (of 4 - 6 Mb size), which were previously evaluated with Gigabayes, Samtools and GATK, and found that Freebayes reports nonexisting snps while missing well-defined ones. In fact, not a single snp was correctly predicted, no matter which parameters have been used.

                  Then, after reading the above post of d17, I decided to try Freebayes on smaller reference. I have generated two artificial sets of reads to a 128 kb template with 10 variant sites of different complexity. One set provided 50x , another one 400x coverage, and the alignment was performed with bwa. On this alignments, Freebayes has generated sane vcf output: no false positives, several snps were detected correctly. Still, the efficiency was quite low: for 50x dataset, it never reported more than 3 variants out of 10, and for 400x dataset it was 4-5 depending on settings. For comparison, Samtools 1.18 detected all 10 variants even on 50x dataset.

                  To my mind, Freebayes may have some problem with handling cashed sequence data, that's why it works with kb-sized but fails on Mb-sized references. On the other hand, it's still being developed. Maybe eventually these bugs will be fixed.

                  Comment

                  • wanguan2000
                    Member
                    • Nov 2010
                    • 24

                    #10
                    Originally posted by garwuf View Post
                    I gave quite an extensive try to Freebayes recently, and wouldn't recommend it in its current state. I have tried it on several bacterial datasets (of 4 - 6 Mb size), which were previously evaluated with Gigabayes, Samtools and GATK, and found that Freebayes reports nonexisting snps while missing well-defined ones. In fact, not a single snp was correctly predicted, no matter which parameters have been used.

                    Then, after reading the above post of d17, I decided to try Freebayes on smaller reference. I have generated two artificial sets of reads to a 128 kb template with 10 variant sites of different complexity. One set provided 50x , another one 400x coverage, and the alignment was performed with bwa. On this alignments, Freebayes has generated sane vcf output: no false positives, several snps were detected correctly. Still, the efficiency was quite low: for 50x dataset, it never reported more than 3 variants out of 10, and for 400x dataset it was 4-5 depending on settings. For comparison, Samtools 1.18 detected all 10 variants even on 50x dataset.

                    To my mind, Freebayes may have some problem with handling cashed sequence data, that's why it works with kb-sized but fails on Mb-sized references. On the other hand, it's still being developed. Maybe eventually these bugs will be fixed.
                    what about samtools vs GATK snp efficiency for ploidy?

                    Comment

                    • garwuf
                      Junior Member
                      • Mar 2009
                      • 7

                      #11
                      Originally posted by wanguan2000 View Post
                      what about samtools vs GATK snp efficiency for ploidy?
                      I do not quite get what did you mean by the "efficiency for ploidy". GATK is optimized for diploid genomes. Still, it can be used on haploid ones. You may have genotype part of vcf output screwed up, but it will detect snps anyway. When searching for snps/indels in haploid genomes, samtools is clearly superior to GATK but it's rather because of difference in search algorithms. At best, GATK can report ~60% of variants, detected by samtools. GATK's UnifiedGenotyper is still not good with indels despite they had some progress during last year. Gigabayes was almost as good as samtools til version 0.1.15 despite it can operate only on Mosaik alignments. The most recent samtools versions (0.1.17-0.1.18) perform noticeably better than it with regard to "correct variant/false positive" ratio. I still run Gigabayes alongside with samtools, just because sometime it can detect a variant overlooked by samtools. But this is a rare event, it's like 1-2 variants per 4 Mb-sized genome.
                      Last edited by garwuf; 11-29-2011, 07:28 AM.

                      Comment

                      • wanguan2000
                        Member
                        • Nov 2010
                        • 24

                        #12
                        Thank U for garwuf explanation. I think:You mean samtools 0.1.15 better than GATK for ploidy SNP Calling?
                        Both samtools and GATK SNP VCF results have heterozygosis SNP for ploidy, and those SNPs are reliable or not?but freebayes's reults contain only homozygous SNP。
                        I wonder why heterozygosis SNP was occur in ploidy。

                        Comment

                        • jflowers
                          Member
                          • Oct 2011
                          • 42

                          #13
                          I also need to call SNPs on haploid genomes. It looks like methods like samtools mpileup / bcftools won't work because the Bayes snp-calling formula uses the allele frequency spectrum as the prior (but the AFS is estimated assuming diploidy).

                          Can anyone suggest a workaround?

                          Comment

                          • jgibbons1
                            Senior Member
                            • Oct 2009
                            • 135

                            #14
                            I've been using Maq (http://maq.sourceforge.net/maq-man.shtml) for SNP detection in my haploid system. No complaints whatsoever.

                            Comment

                            • Kasycas
                              Member
                              • Sep 2009
                              • 22

                              #15
                              Hi jgibbons1, I've been using MAQ as well but the snp output is useless without annotation. Have you come across a good way to annotate the output that MAQ produces?

                              Thanks!

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 11:10 AM
                              0 responses
                              7 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              42 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              104 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              125 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...