Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA samflag inconsistencies

    Dear bioinformatics community,

    I have used BWA to align paired-end reads to a mammalian genome. I extracted orphan reads from the alignment with this command:

    awk 'and ($2, 0x0004) && ! and ($2, 0x0008) && $3!~/Un/' alignment.sam > orphans.sam

    I have noticed some apparent errors in the output, for example:

    HWUSI-EAS091C_0049:8:10:13851:9893#0 167 27 39960072 29 11S64M 28 167 0 TGTGAATGTAGTGCCCATGAAGGTGGATGCAGAGATCCGGCAGTCGTTGCAGTGCACCCTGCTGTGTCCCCGTAT IIHIIIIIIIIGIIIIIIHIIIIFIIHIIIHHIGIIIIIFIIHHHIIIFHHIGIIIHHIGHGGEEHGGIHGGDEE XT:A:M NM:i:0 SM:i:29 AM:i:29 XM:i:0 XO:i:0 XG:i:0 MD:Z:64

    Using this tool to interpret the samflag http://picard.sourceforge.net/explain-flags.html, it seems that this read is:

    Paired
    Proper pair
    Read unmapped
    Mate reverse strand
    Second in pair

    -how can the read be in a proper pair, yet unmapped?
    -how can the read have a mapping location and a non-zero mapping quality if it is unmapped?

    Another example:

    HWUSI-EAS091C_0049:8:17:3987:3357#0 117 6 84719005 37 9M1D66M Un0684.1-12705 6795 0 ATGATGATGTGGTGATGATGATGATGATGATGATTATGATGATGATGATGATGATGATGGTGATGATGATGATGG IIDIIIGIIG<IHIHHHIEIIGIHGHIIIHDIHFIIIIIIGIIHIIHHIIDIIIIGIIIIIIGIHIIIIIIIIII XT:A:U NM:i:4 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 MD:Z:9^A25G6G32A0

    Paired
    Unmapped
    Read reverse
    Mate reverse
    First in pair

    -how can it be ‘mapped to reverse strand’ yet unmapped?
    -how can it have a mapping location and a non-zero mapping quality if it unmapped?


    On a previous occasion with the same data, I encountered errors when adding RG headers with Picard:

    Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 2096393, Read name HWUSI-EAS091C_0049:1:29:16923:17016#0, MAPQ should be 0 for unmapped read.

    There were countless of these errors, and I was forced to add the flag "VALIDATION_STRINGENCY=LENIENT" to the command line (a helpful tip from this site) to complete task.

    Why is BWA giving unmapped reads non-zero mapping qualities?

    Any help here would be much appreciated, because other than this I am rather fond of the speed and accuracy of BWA compared to other aligners so it would be nice to be able to correct this issue.

    Thanks in advance,

    Tally

  • #2
    Originally posted by Tally View Post
    Using this tool to interpret the samflag http://picard.sourceforge.net/explain-flags.html, it seems that this read is:

    Paired
    Proper pair
    Read unmapped
    Mate reverse strand
    Second in pair

    -how can the read be in a proper pair, yet unmapped?
    -how can the read have a mapping location and a non-zero mapping quality if it is unmapped?
    Technically the specification does say "If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10 and 0x100 and the bit 0x20 of the next segment in the template." i.e. If the read is unmapped, those other flags should be ignored.

    Thus BWA is within the specification, but nevertheless the output is confusing and arguably a bug.

    Originally posted by Tally View Post
    On a previous occasion with the same data, I encountered errors when adding RG headers with Picard:

    Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 2096393, Read name HWUSI-EAS091C_0049:1:29:16923:17016#0, MAPQ should be 0 for unmapped read.
    Again same bit of the specification, "If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10 and 0x100 and the bit 0x20 of the next segment in the template." i.e. If the read is unmapped, then MAPQ should be ignored.

    Again, BWA is within the specification, but it would be cleaner to use MAPQ of zero which is what Picard checks for with its default stringency.

    So again, BWA is within the specification, but nevertheless the output is confusing and arguably a bug.
    Last edited by maubp; 11-25-2011, 03:35 AM. Reason: Added links to the spec

    Comment


    • #3
      On further reflection, you could argue Picard's http://picard.sourceforge.net/explain-flags.html is broken as it should ignore these bits when 0x4 is set. Perhaps it would be more helpful to put "(irrelevant)" or something next to them as that would be more useful.

      Comment


      • #4
        To clairfy a bit, bwa concatantes reference sequences, so if a read hangs off of one sequence onto another, it will be both given the right mapping coordinates, and give then 4 flag.

        Second, sam specs call for the unmapped mate of a mapped read to be given the mapping position of the mapped mate. This is so that the unmapped read will sort next to its mapped mate. But that's another source of reads that have the 4 flagged, and have mapping coordinates.

        So this is why you have reads like that.

        Comment


        • #5
          Originally posted by swbarnes2 View Post
          Second, sam specs call for the unmapped mate of a mapped read to be given the mapping position of the mapped mate. This is so that the unmapped read will sort next to its mapped mate. But that's another source of reads that have the 4 flagged, and have mapping coordinates
          Thanks - I missed that question in the original post.

          Comment


          • #6
            Thanks for the responses, that is very helpful. That certainly makes sense to give unmapped mates the mapped mate coordinates for sorting purposes, however it is confusing that BWA sets other fields when the read is unmapped. Thank you for explaining to ignore those fields, as I missed that in the SAM specification.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            21 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X