Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA rescue of multi-mapping or unmapped reads

    In reading the BWA manual, I have come across a question that I still cannot answer. It reads:

    In paired-end alignment, BWA pairs all hits it found. It further performs Smith-Waterman alignment for unmapped reads with mates mapped to rescue mapped mates, and for high-quality anomalous pairs to fix potential alignment errors.
    So BWA will rescue mates presumably if there is a high-quality, uniquely mapping mate pair. My question is: what are the parameters that the unmapped mate have to meet to be considered correctly placed? For instance, if I have 2x100 reads, and one maps uniquely, does the other one have to map with the same stringency as I specified in the aln stage. Or if say, only 30nt of the unmapped read map, and the other 70nt do not (due to say, an inversion), will the unmapped read be mapped and the 70 non-mapping nucleotides just be soft-clipped in the BAM file?

    Any help would be greatly appreciated.

    Cheers,
    Kevin

  • #2
    In some cases, the source code is the best documentation. I assume you mean in "bwa sampe", so check out the "bwa_paired_sw" function. It looks like unmapped mates are rescued if the mapping quality of the other end is greater than SW_MIN_MAPQ (set to 17).

    Comment


    • #3
      I believe that I found what I was looking for. Per Nils' suggestion, I went to the source code.

      My primary query concerned the length of the match (of an unmapped read that was re-mapped using a Smith-Waterman alignment in the vicinity of a uniquely mapped read) that was required for BWA to place it.

      #define SW_MIN_MATCH_LEN 20
      #define SW_MIN_MAPQ 17
      ...
      bwa_cigar_t *bwa_sw_core(bwtint_t l_pac, const ubyte_t *pacseq, int len, const ubyte_t *seq, int64_t *beg, int reglen, int *n_cigar, uint32_t *_cnt)
      {
      ...
      // check whether there are too many N's
      if (reglen < SW_MIN_MATCH_LEN || (int64_t)l_pac - *beg < len) return 0;
      for (k = 0, x = 0; k < len; ++k)
      if (seq[k] >= 4) ++x;
      if ((float)x/len >= 0.25 || len - x < SW_MIN_MATCH_LEN) return 0;
      ...
      if (x < SW_MIN_MATCH_LEN || y < SW_MIN_MATCH_LEN) { // not good enough
      free(path); free(cigar); free(ref_seq);
      *n_cigar = 0;
      return 0;
      So basically, if there are more then 20 bases that match (in a Smith-Waterman re-alignment) the unmapped (or improperly mapped) mate will be rescued.

      Cheers.

      Comment


      • #4
        Thanks for the advice, Nils.

        Comment


        • #5
          It would be nice to have them as command line options to see what effect there is when varying these values.

          Comment


          • #6
            What do anomalous pairs mean. Samtools mpileup is removing such reads(when compared to seeing these in IGV) unless -A option is added but Im not sure what it means.

            Code:
            Samtools spec says
            Added `mpileup -A' to allow to use reads in anomalous pairs in SNP calling.

            Comment


            • #7
              Originally posted by nilshomer View Post
              It would be nice to have them as command line options to see what effect there is when varying these values.
              Indeed it would be!
              Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
              Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
              Projects: U87MG whole genome sequence [Website] [Paper]

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X