Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem calling consensus sequence with Samtools mpileup?


    Hi all, I hope someone can help, have you come across anything like this yourself? I have several bacterial strain BWA genome assemblies. I am using Samtools to call the consensus and have done it twice using 2 different sets of options and come up against 2 problems:

    Program: samtools-0.1.18

    Command used to call FIRST consensus: samtools mpileup -q 30 -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
    (All FASTQ files converted to FASTA for phylogenetic analysis)

    PROBLEM 1: Core genome variable sites were found to have significant homoplasies (significant P values in Splitstree) which was not expected. On closer investigation of the sequence, it was found that in regions with 2 or more SNPs that were closely situated together in relation to the reference, the program had called the reference sequence instead of the SNPs, leading to multiple strains with the same sequence at these sites when in fact they were variants! The SNPs looked to be genuine (In IGV and the VCF file we produced for all sites, there is adequate read depth, mapping quality etc)

    To try to rectify this, we disabled BAQ in the next set of consensus calling:
    Program: samtools-0.1.18

    Command used to call SECOND consensus: samtools mpileup -q 30 -B -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
    (All FASTQ files converted to FASTA for phylogenetic analysis)

    PROBLEM2: Now we find that this seems to have rectified the problem where the reference sequence has been called instead of the variants (2 SNPS close together) , but instead, there are an increased number of uncalled bases. On investigation, these seem to be at sites where there may be some reads with a T, for example, and a few reads with an A (eg 2/8 reads are A, while the remaining reads are T, and the reference is T!). In some strains, it's calling 'N', in some, it's calling 'W' and in others it calls the 'T'! (But they all have varying levels of read depth and mapping quality)

    See results from the VCF file at one example site (618760) where the ref base is 'T':
    Strain 1 uncalled base 'N':
    REF 618760 . T . 54 . DP=10;VDB=0.0571;AF1=0;AC1=0;DP4=0,8,0,0;MQ=58;FQ=-51 PL 0[/FONT]
    Strain 2 ambiguous base 'W':
    REF 618760 . T . 72 . DP=19;VDB=0.0027;AF1=0;AC1=0;DP4=1,13,0,0;MQ=60;FQ=-69 PL 0
    Strain 3 called base 'T':
    REF 618760 . T . 96 . DP=25;VDB=0.0672;AF1=0;AC1=0;DP4=1,21,0,0;MQ=60;FQ=-93 PL 0

    Questions:

    a) What exactly is causing this variation in base calling among the strains in the second set of consensus sequences? (disabled BAQ or is it something I can tell from the VCF file?)
    b) What does VDB flag stand for? Can't find any information on it? (IF we remove it from the VCF file and recall the consensus, it will call the base, but we don't want to do this at all~25000 sites with a VDB flag in case it introduces error!)

    Any help. is very much appreciated, I just can't understand it, especially why it's having trouble calling the sites with mixed bases, when the read depth and mapping quality are good, and by looking at the assembly you can clearly see which base it should call!

    Best Wishes,

    Laura Spoor

  • #2
    Try rerunning it and include the -A option. I bet your variant reads are anomalous and hence not being utilized. Evidence for this is that:

    REF 618760 . T . 54 . DP=10;DP4=0,8,0,0;
    St
    REF 618760 . T . 72 . DP=19;DP4=1,13,0,0;

    REF 618760 . T . 96 . DP=25;DP4=1,21,0,0;

    The read counts in DP4 do not add up to the read counts in DP, and no variant alleles seem to be included.

    I have not been able to figure out what VDB means.

    Comment


    • #3
      It's normal for the DP4 and DP to not match up, as DP4 is reads that pass some filter, I forget what, maybe MQ?

      What's odd are the dots for the alt call. I've never seen that, and I don't know what its suppsosed to mean. Maybe if you posted the sam entries for all the reads crossing one of those points.

      My samtoools/bcftools made vcf says this about vdb:

      ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
      I'm guessing it means "Hey, this might be a misalignment due to a nearby SNP, and not a real variant"

      Comment


      • #4
        I think the dots in the VCF are classed as a missing values, which I think is because I've generated a VCF file for all sites, rather than just the variants, so it includes regions of invariant sites and also non-mapping regions.

        In the end, I found that invoking the -E option in samtools mpileup (this implements extended BAQ computation) increased the sensitivity so that SNPs that were close to each other were called, I guess according to the manual it will decrease specificity slightly but at least I'm not getting a load of false positive homoplasic sites!

        I haven't been able to find much information on variance distance bias, other than in the release notes for samtools-0.1.18 where it says:
        "Implemented variant distance bias as a new filter (by Petr Danecek)."

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Advancing Precision Medicine for Rare Diseases in Children
          by seqadmin




          Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
          12-16-2024, 07:57 AM
        • seqadmin
          Recent Advances in Sequencing Technologies
          by seqadmin



          Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

          Long-Read Sequencing
          Long-read sequencing has seen remarkable advancements,...
          12-02-2024, 01:49 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 12-17-2024, 10:28 AM
        0 responses
        23 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-13-2024, 08:24 AM
        0 responses
        42 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-12-2024, 07:41 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 12-11-2024, 07:45 AM
        0 responses
        42 views
        0 likes
        Last Post seqadmin  
        Working...
        X