Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bowtie vs BWA: only 50% overlapping SNPs

    Hi everyone,

    I've run bowtie and bwa (using generally default parameters) on the same dataset and provisionally called SNPs using samtools pileup on each of them (without removing duplicates, doing local realignment etc).
    The results surprised me, so I'm wondering whether this is what should be expected.

    Basically, the SNPs found in bowtie and bwa alignments overlap by only ~50%. There are fewer bwa-only SNPs, but still a significant amount. In terms of samtools consensus scores (within a reasonable coverage range (5-25)), SNPs detected by both aligners tend to have the highest scores, followed rather closely by bwa-only and then bowtie-only SNPs.

    I'm quite surprised that the two alignments produced by essentially similar algorithms differ so much. Is this to be expected? What would be the best strategy for dealing with this discrepancy? Just trust bwa since it can detect indels and forget about bowtie? Or focus only on SNPs that are found in both bowtie and bwa alignments? Or maybe this discrepancy indicates that there's some problem with the initial dataset in the first place?

    Will greatly appreciate your thoughts/experience on this...

  • #2
    I would trust the bwa results over bowtie as bowtie does not do indels. Nearby indels could influence true positive SNPs. Have a look at the 1000 Genomes and what they found.
    If you're using human data you could compare your SNPs to DBSNP for overlap with SNVs and microindels. With real data it's always going to be hard to try separate the true positives out. Set fairly good filters for best results.

    Also take a look at another aligner, novoalign, which does a full Needleman Wunsch across the short read that can greatly improve on your SNP calls.

    Comment


    • #3
      Thanks for your reply!

      Comment


      • #4
        Also consider dindel or SRMA, which performed localized realignment (basically, multiple alignment) to deal with this issue.

        On paper, this should be superior to Needleman-Wunsch (or Smith-Waterman) because there are ambiguities in how a read should be aligned which can cause issues. For example, imagine the following alignments to a reference (with the reference shown as the first line)

        Code:
        GATCAAAAGATC
        GATCAAA-GATC
        GATCAA-AGATC
        GATCA-AAGATC
        GATC-AAAGATC
        By the rules of pairwise alignment, all of the gapped alignments are equally valid. But, a multiple aligner should enforce consistency of the gap placement (it's still equally valid in any location, so one must be picked arbitrarily).

        Real data gets even messier. You can imagine how some noise or a SNP might throw alignment off a bit more and cause further ambiguity.

        I haven't tried SRMA yet, but dindel's output is filled with estimates of which polynucleotide runs are real and which are more likely artifactual. Be forewarned that this information doesn't come cheap; dindel uses a lot of CPU (as you might expect).

        Comment


        • #5
          Thanks for your answer! I haven't heard of dindel, but I'm indeed trying SRMA and GATK indel realigner post-hoc. Will read up on dindel.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          39 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          41 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          35 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X