Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • haploid variant calling in SAMtools

    We’re using the standard SAMtools SNP calling protocol (samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf) to look for sequence variants in bacterial genomes.

    I’m really new to this, but it sounds like the protocol is geared towards diploid organisms by default. I hoping someone might know if there’s anything that should be changed to use it on haploid genomes? The –s or –p options in SAMtools mpileup look like they might be useful, but I’ve no feel for how to apply them correctly for this. Any help would be really appreciated!

  • #2
    I use samtools on bacteria all the time, and it's not really a problem. Remember that just because your organism is haploid doesn't mean that you have been giving a clonal culture to analyze.

    Comment


    • #3
      We found that the following workflow works well for variant calling in haploids using samtools. To generate a consensus sequence from the realigned BAM file in FASTQ format, we used:

      samtools mpileup -uf ref.fasta realigned.bam | bcftools view -cg -s sample.txt - | perl vcfutils.pl vcf2fq > consensus.fq

      where sample.txt is a plain text file with the sample name in the first column and the ploidy level in the second column. For example,

      sample_name 1

      The columns should be tab-delimited.

      Comment


      • #4
        Keep in mind that vcfutils vcf2fq will not handle indels. It just creates a small window of lowercase letters around the putative indel.

        Comment


        • #5
          Why not use freebayes? It's designed to support haploids and polyploids (use the --ploidy option).

          Comment


          • #6
            Also, you can use vcfgeno2haplo (https://github.com/ekg/vcflib/blob/m...geno2haplo.cpp) with an arbitrarily large window size to generate a consensus fasta sequence from the output.

            If you input a file with one sample, the recorded consensus sequence will be the ALT in the single VCF record in the output.

            It works for indels as well as SNPs. It does not generate quality estimates.

            Comment


            • #7
              This paper (apologies for self-publicity) might be of interest to you:

              High-throughput microbial population genomics using the Cortex variation assembler. Z Iqbal, I Turner, G McVean, Bioinformatics 2012


              Fast highly specific and sensitive variant discovery and genotyping for microbial genomes by multi-sample de novo assembly, allowing direct genome comparison without using a reference. The paper contains examples with short and long reads, reproducing results from published papers (drug-resistant/susceptible S.aureus, in-host evolution) with a fraction of the time and effort.

              Comment


              • #8
                @Zam, this looks great! I'm looking forward to reading it.

                Comment


                • #9
                  Hi,

                  I've been reading through a number of threads and manuals about this and still don't understand how to identify if SNPs are homozygous or heterozygous when viewing the vcf file. Has anyone seen any issues with using the -s parameter in the bcftools view for setting the ploidy as mentioned above? Not sure if it matters or not, but my samples aren't bacteria, but are haploid males. Thanks!

                  Oh and yes, we are looking into Freebayes too.

                  Comment


                  • #10
                    Hi again. So I decided to do a test a merged sample (merged from three runs of the same sample) and got an error I'm not sure about.

                    Here is the command I used: samtools mpileup –Dsuf ref.fa Sample.bam | bcftools view –bcgv –s Sample.txt - > Sample.raw.bcf

                    Then got this:
                    [mpileup] 1 samples in 1 input file
                    <mpileup> set max per-file depth to 8000
                    <bcf_hdr_subsam> 1 samples in the list but not in BCF.

                    So it's something wrong with the text file I used for the -s parameter in bcftools for ploidy(I think?). The text file seems like it would be pretty straight forward, but I don't know what could cause the issue for the "samples in list but not in BCF" error. Any ideas? Thanks.

                    Comment


                    • #11
                      Hey kjm,
                      Assuming you haven't answered this on your own yet (or assuming someone else is having a similar error and needs the thread to not die at your question), I think I have an answer.

                      If you are piping it the way you are in your post, making the "Sample.txt" file simply a dash, tab, and 1 should cover it. That way it reads your file name as the piped filename and continues as usual. Got rid of my "<bcf_hdr_subsam> 1 samples in the list but not in BCF." error anyway... Hope that helps someone.

                      Sample.txt should read:
                      - 1

                      Comment


                      • #12
                        variant calling in SAMtools

                        Dear ZLounsberry,

                        I saw your answer , but I didn't understand how the text file should be .

                        I am running the following command:
                        samtools mpileup –uf ucsc.hg19.fasta child.bam father.bam mother.bam | bcftools view -bvcgT trioauto -s family.txt - > femvar.vcf &

                        My txt file is:

                        child.bam
                        father.bam
                        mother.bam


                        I get an error:
                        [mpileup] 3 samples in 3 input files
                        <mpileup> Set max per-file depth to 2666
                        <bcf_hdr_subsam> 3 samples in the list but not in BCF.


                        Do you have any suggestion how to fix the problem?
                        Thank you for your help,

                        Comment


                        • #13
                          Hello polpol,
                          Try changing your "family.txt" file to read:

                          - 1

                          instead of:

                          child.bam
                          father.bam
                          mother.bam

                          If that does not work, try:

                          - 3

                          (so a dash, a tab character, and a 1 or, if that doesn't work, maybe a 3). It's a bit off topic for this thread, so feel free to email me (my username AT gmail DOT com) and let me know if you need a hand.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          25 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X