Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Odd behavior using samtools mpileup, bcftools

    I've tried this a few times and I'm getting very occassional / odd behavior in the output from mpileup / bcftools / vcfutils varFilter. The output reports an indel, but upon examination of the reads (with IGV), a single read contains the indel, while all the others do not, yet the vcf output file for my data reports:

    Chr1-gr2-PU5 662026 . ATGGAATGGTGGATTGG ATGGAATGGTGGATTGGAATGGTGGATTGG 99 . INDEL;DP=398;AF1=1;CI95=1,1;DP4=0,0,173,125;MQ=60 PL:GT:GQ 255,255,0:1/1:99

    One problem is that this output appears to state that there are 173, and 125 reads supporting the indel, when in fact there is 1. Other than manual inspection, I can't tell this apart from other true indels.

    I'm curious if this is due to a parameter in the command line? or if this is some kind of bug? (probably the former).

    The versions and command line parameters are as follows:
    This is a bacterial genome resequencing project with read depth > 300.
    Paired-end, 100 nt reads were aligned with default BWA parameters to a reference genome (BWA version 0.5.9-r16. (The genome contains some IUPAC codes, but none in the region of the problem I'm reporting).

    The resulting (sorted, indexed) bam file was used for input into samtools mpileup (0.1.12a)

    >samtools mpileup -uf ref.fasta aln.sorted.bam |bcftools view -bvcg - > aln.raw.bcf

    >bcftools view aln.raw.bcf | vcfutils.pl varFilter > aln.flt.vcf


    Thanks for any help !

    Jim

  • #2
    in the varfilter there is an option of specifying the minimum depth of reads that are aligned to be called as a SNP/InDel . I guess its the -D parameter.

    Comment


    • #3
      Between mpileup, and eyeballing IGV, I'd be more inclined to think that IGV is for some reason refusing to display the relevant reads, than that samtools is just making up a bunch of reads that really aren't there.

      Look at the .sam entries for that region. That's the sure way.
      Last edited by swbarnes2; 05-15-2013, 03:51 PM.

      Comment


      • #4
        I havea a similar problems with calling InDels

        I tried to see if I can get InDels by using some contigs from E.Coli that was generated by SGA assembler. In a way the contigs would serve as a single end NGS data with the sequences being much longer.

        I first aligned the contigs to reference e.coli using bwa-sw to obtain SAM/BAM file. The BAM file was then sorted and indexed.

        Then, here is what I did for pileup for processing for a small region of interest:

        samtools mpileup -r 0:4,000,679-5,000,894 -uf e.coli.fa contigalignsorted.bam > contigalignsorted.region.mpileup

        Then to obtain SNPs and InDels from mpileup file, here is what I did:

        bcftools view -bvcg contigalignsorted.region.mpileup > contigalignsorted.region.bcf

        bcftools view contigalignsorted.region.bcf > contigalignsorted.region.vcf


        However, when i see my alignment of BAM file on IGV for this region, I notice that the final VCF file obtained through commands above, has a lot of SNPs and InDels missing!


        What went wrong in the calls ?

        Comment


        • #5
          More importantly , it even shows up wrong SNPs which otherwise is not a SNP as per the bam file loaded on IGV.


          0 4009334 . A C 125 . DP=5;VDB=3.277706e-02;RPB=-1.291774e+00;AF1=0.5;AC1=1;DP4=1,1,2,1;MQ=59;FQ=74;PV4=1,0,1,0.38 GT:PL:GQ 0/1:155,0,101:99
          0 4009336 . T A 17.1 . DP=5;VDB=6.080000e-02;RPB=1.291774e+00;AF1=0.5;AC1=1;DP4=2,1,1,1;MQ=59;FQ=20.1;PV4=1,1,0.14,1 GT:PL:GQ 0/1:47,0,154:50

          As above 4009334 is a SNP but not as per IGV picture below:
          Attached Files

          Comment


          • #6
            IGV is not designed to visualize exactly what samtools mpileup is doing. They are different pieces of software, and they filter things differently.

            Likely, IGV is for some reason filtering away the relevant reads, or something else is different between the two pieces of software.

            Your vcf says there is a mixed C/A SNP, and your picture shows a mixed C/A SNP. Maybe there is a discrepancy in the references between the reference you aligned to for the vcf, and the reference you are looking at in IGV.

            Examine the raw read themselves, in the .bam. That's the way to be sure.
            Last edited by swbarnes2; 05-21-2013, 11:22 AM.

            Comment


            • #7
              @swbarnes2

              I did the indexing of my genome file which was used for alignment. I used the same genome to be uploaded to IGV and not ones already provided by IGV drop-down options. You can clearly see in the picture that IGV does not point out a SNP at location 4009334 while SAMtools/mpileup/bcftools extract it out! One got to be wrong, possibly the latter.

              Comment


              • #8
                Thanks @swbarnes2 for pointing out the difference between IGV and SAMtools. I am now using TVIEW and am able to see the SNP from my BAM file.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                22 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                42 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                42 views
                0 likes
                Last Post seqadmin  
                Working...
                X