Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem calling consensus sequence with Samtools mpileup?


    Hi all, I hope someone can help, have you come across anything like this yourself? I have several bacterial strain BWA genome assemblies. I am using Samtools to call the consensus and have done it twice using 2 different sets of options and come up against 2 problems:

    Program: samtools-0.1.18

    Command used to call FIRST consensus: samtools mpileup -q 30 -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
    (All FASTQ files converted to FASTA for phylogenetic analysis)

    PROBLEM 1: Core genome variable sites were found to have significant homoplasies (significant P values in Splitstree) which was not expected. On closer investigation of the sequence, it was found that in regions with 2 or more SNPs that were closely situated together in relation to the reference, the program had called the reference sequence instead of the SNPs, leading to multiple strains with the same sequence at these sites when in fact they were variants! The SNPs looked to be genuine (In IGV and the VCF file we produced for all sites, there is adequate read depth, mapping quality etc)

    To try to rectify this, we disabled BAQ in the next set of consensus calling:
    Program: samtools-0.1.18

    Command used to call SECOND consensus: samtools mpileup -q 30 -B -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
    (All FASTQ files converted to FASTA for phylogenetic analysis)

    PROBLEM2: Now we find that this seems to have rectified the problem where the reference sequence has been called instead of the variants (2 SNPS close together) , but instead, there are an increased number of uncalled bases. On investigation, these seem to be at sites where there may be some reads with a T, for example, and a few reads with an A (eg 2/8 reads are A, while the remaining reads are T, and the reference is T!). In some strains, it's calling 'N', in some, it's calling 'W' and in others it calls the 'T'! (But they all have varying levels of read depth and mapping quality)

    See results from the VCF file at one example site (618760) where the ref base is 'T':
    Strain 1 uncalled base 'N':
    REF 618760 . T . 54 . DP=10;VDB=0.0571;AF1=0;AC1=0;DP4=0,8,0,0;MQ=58;FQ=-51 PL 0[/FONT]
    Strain 2 ambiguous base 'W':
    REF 618760 . T . 72 . DP=19;VDB=0.0027;AF1=0;AC1=0;DP4=1,13,0,0;MQ=60;FQ=-69 PL 0
    Strain 3 called base 'T':
    REF 618760 . T . 96 . DP=25;VDB=0.0672;AF1=0;AC1=0;DP4=1,21,0,0;MQ=60;FQ=-93 PL 0

    Questions:

    a) What exactly is causing this variation in base calling among the strains in the second set of consensus sequences? (disabled BAQ or is it something I can tell from the VCF file?)
    b) What does VDB flag stand for? Can't find any information on it? (IF we remove it from the VCF file and recall the consensus, it will call the base, but we don't want to do this at all~25000 sites with a VDB flag in case it introduces error!)

    Any help. is very much appreciated, I just can't understand it, especially why it's having trouble calling the sites with mixed bases, when the read depth and mapping quality are good, and by looking at the assembly you can clearly see which base it should call!

    Best Wishes,

    Laura Spoor

  • #2
    Try rerunning it and include the -A option. I bet your variant reads are anomalous and hence not being utilized. Evidence for this is that:

    REF 618760 . T . 54 . DP=10;DP4=0,8,0,0;
    St
    REF 618760 . T . 72 . DP=19;DP4=1,13,0,0;

    REF 618760 . T . 96 . DP=25;DP4=1,21,0,0;

    The read counts in DP4 do not add up to the read counts in DP, and no variant alleles seem to be included.

    I have not been able to figure out what VDB means.

    Comment


    • #3
      It's normal for the DP4 and DP to not match up, as DP4 is reads that pass some filter, I forget what, maybe MQ?

      What's odd are the dots for the alt call. I've never seen that, and I don't know what its suppsosed to mean. Maybe if you posted the sam entries for all the reads crossing one of those points.

      My samtoools/bcftools made vcf says this about vdb:

      ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
      I'm guessing it means "Hey, this might be a misalignment due to a nearby SNP, and not a real variant"

      Comment


      • #4
        I think the dots in the VCF are classed as a missing values, which I think is because I've generated a VCF file for all sites, rather than just the variants, so it includes regions of invariant sites and also non-mapping regions.

        In the end, I found that invoking the -E option in samtools mpileup (this implements extended BAQ computation) increased the sensitivity so that SNPs that were close to each other were called, I guess according to the manual it will decrease specificity slightly but at least I'm not getting a load of false positive homoplasic sites!

        I haven't been able to find much information on variance distance bias, other than in the release notes for samtools-0.1.18 where it says:
        "Implemented variant distance bias as a new filter (by Petr Danecek)."

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        27 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        26 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X