Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAMTools mpileup issue

    Hi there,
    I am following the following workflow:

    Sample ID from 1000 Genomes project: HG00096

    First, indexing the bam file using samtool index
    Second, sorting the bam file using samtools sort eg2.bam eg2.sorted
    Third, generating a bcf file using samtools mpileup -uf reference_genome_file eg2.sorted.bam | bcftools view -bvcg - > eg2.raw.bcf

    The reference human genome file is the one obtained from 1kGenomes, Human_37_*.fasta

    Lastly, converting my bcf file to vcf using bcftools view *.bcf | perl /usr/share/samtools/vcfutils.pl varFilter -D100 > *.vcf

    I am interested in knowing genotypes for rs numbered dbSNPs and other variants for which I have HG37 genomic coordinates.

    When i follow the above workflow, i am not getting 0/0 genotypes and at the same time some of the genomic locations also do not show up. This is why my first question was if absence of variation should be considered a 0/0 genotype?

    Thanks,
    Ashwin

  • #2
    'bcftools view -bvcg' will only show you the variant locations. So you won't get sites where all the samples in the .bam are homozygous reference. 'bcftools view -bcg' should do the trick.

    Comment


    • #3
      bcftools view -bvcg will give a variants-only vcf. However, whether this is a good way of getting genotypes is a bit unclear, but may not be appropriate (depending on your application).

      In this post (http://biostar.stackexchange.com/que...quality-scores), lh3 suggests that mpileup / bcftools is not really meant for calling genotypes, but instead making inferences about the data when genotypes are unknown. In fact, lh3 suggests that calling genotypes is a "flaw" of bcftools (see also the original paper, Bioinformatics. 27(21) 2987-2993).

      This seems like a subtle, but important point.

      Comment


      • #4
        i think this is a really good point. what i am after is to identify the genotype at all genomic locations but it seems that mplieup is really meat to find variations.

        also, i deleted the -v switch and i still cannot get certain genomic locations to show up in my VCF file. should i assume the absence of that location to be homozygous normal at that location?

        Comment


        • #5
          I have some of the same questions as you ashkot when it comes to getting a consensus genome, [especially in light of the "flaw" concerning genotypes inferred by bcftools mentioned by lh3 I mentioned in my previous post on this thread]. So, I'm not really sure I can be of much help.

          The topic of generating a consensus has also been discussed here:
          Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


          I would defer to comments made in that thread since lh3 developed the methods in question. But, what's not clear to me is whether the "flaws" with using bcftools to get genotypes in the vcf (mentioned previously in this thread) lead to bad genotypes being propagated throughout the consensus. Or does the method of calling genotypes in the consensus using vcfutils.pl vcf2fq differ from the method used by bcftools to call genotypes in the vcf. Can someone more knowledgable chime in here?

          Anyhow, generating a consensus is described here:


          As far as missing sites in your vcf file, have you looked at the sites that are missing using samtools tview to see if there are reads spanning those sites?

          Comment


          • #6
            Originally posted by ashkot View Post
            Hi there,

            First, indexing the bam file using samtool index
            Second, sorting the bam file using samtools sort eg2.bam eg2.sorted
            Third, generating a bcf file using samtools mpileup -uf reference_genome_file eg2.sorted.bam | bcftools view -bvcg - > eg2.raw.bcf
            Hi Ashwin,

            I don't know if this would matter, but I have always sorted the bam file first and then created it's index.

            The thing is, based on the steps that you have mentioned, your index file would be based on your original bam file, and since you are using the sorted bam file in your mpileup, this might be causing some issues.

            Again this is just a thought, but you might want to try it once.

            Thanks,
            Praful

            Comment


            • #7
              Originally posted by aggp11 View Post
              Hi Ashwin,
              The thing is, based on the steps that you have mentioned, your index file would be based on your original bam file, and since you are using the sorted bam file in your mpileup, this might be causing some issues.

              Again this is just a thought, but you might want to try it once.

              Thanks,
              Praful
              You are right, but mpileup doesn't look at the index, so that's not the issue. The issue is that mpileup really isn't designed to do what the poster wants to do with it, so it's working correctly, but not in the off-label way that the poster hoped it would.

              I'm not sure what tool would be more appropriate. What I think will work is making a .bed file of your genotype loci, and using intersectBed with that and your .bam. That'll filter out all the reads that don't cover your genotype loci, which I think will make the vcf smaller, and easier to manage. Then make the all-points vcf by leaving out the -v, then use intersectBed again with that .bed against the vcf. That should leave only the mpileup entries that hit your genotype loci, and since it's all points, the 0/0 entries shoud be there.

              Comment


              • #8
                Originally posted by swbarnes2 View Post
                You are right, but mpileup doesn't look at the index, so that's not the issue. The issue is that mpileup really isn't designed to do what the poster wants to do with it, so it's working correctly, but not in the off-label way that the poster hoped it would.

                I'm not sure what tool would be more appropriate. What I think will work is making a .bed file of your genotype loci, and using intersectBed with that and your .bam. That'll filter out all the reads that don't cover your genotype loci, which I think will make the vcf smaller, and easier to manage. Then make the all-points vcf by leaving out the -v, then use intersectBed again with that .bed against the vcf. That should leave only the mpileup entries that hit your genotype loci, and since it's all points, the 0/0 entries shoud be there.
                Thanks for the information on mpileup. And I like your idea of using intersectBed here.

                Comment


                • #9
                  Oh, and I'll add if you get rid of the -v, and a locus doesn't show up, I'm pretty srue that's an indicator that you have no reads there, or at least no reads of any quality.

                  Comment


                  • #10
                    hi there, thanks for you input, i am going to give intersetBed a try and see what i get. I'll keep everybody informed about the outcome.

                    thanks again.

                    ashwin

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    25 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    27 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X