Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding BWA result

    Hi,

    I've done some testing with BWA against couple of small chromosomes. My test data is paired end data. I don't understand couple lines of my results. I made a small perl script to understand the flag field on the sam output.

    Code:
    #!/usr/bin/perl
    
    use warnings;
    use strict;
    
    my $dec = shift @ARGV;
    for (0 .. 11) {
      if ( $dec & 2**$_ ) {
        print $_+1," bit on (",2**$_,")\n"
      }
    }
    exit;
    That allows me to know what flags are on. I also added textual output on meaning of the fields (but it grows the code so left out here). Anyway I got following result rows in my test (cut the sequence and rest of columns out)

    Code:
    HWI-EAS418:1:1:0:1819#0 89      chr20   29366674        0       50M     =       29366674        0
    HWI-EAS418:1:1:0:1819#0 181     chr20   29366674        0       *       =       29366674        0
    Now running 89 through my script gives out that 1st, 4th, 5th and 7th bits are on. So it's 1st read of pair, on reverse strand with mate unmapped. Second row has 181 so 1st, 3rd, 5th, 6th and 7th bits are on. Now this means 2nd of a pair, strand of mate and query unmapped.

    My questien then. Does this mean that second pair didn't map anything and values given are actually first read of the pair?

  • #2
    Originally posted by Hena View Post
    Hi,

    I've done some testing with BWA against couple of small chromosomes. My test data is paired end data. I don't understand couple lines of my results. I made a small perl script to understand the flag field on the sam output.

    Code:
    #!/usr/bin/perl
    
    use warnings;
    use strict;
    
    my $dec = shift @ARGV;
    for (0 .. 11) {
      if ( $dec & 2**$_ ) {
        print $_+1," bit on (",2**$_,")\n"
      }
    }
    exit;
    That allows me to know what flags are on. I also added textual output on meaning of the fields (but it grows the code so left out here). Anyway I got following result rows in my test (cut the sequence and rest of columns out)

    Code:
    HWI-EAS418:1:1:0:1819#0 89      chr20   29366674        0       50M     =       29366674        0
    HWI-EAS418:1:1:0:1819#0 181     chr20   29366674        0       *       =       29366674        0
    Now running 89 through my script gives out that 1st, 4th, 5th and 7th bits are on. So it's 1st read of pair, on reverse strand with mate unmapped. Second row has 181 so 1st, 3rd, 5th, 6th and 7th bits are on. Now this means 2nd of a pair, strand of mate and query unmapped.

    My questien then. Does this mean that second pair didn't map anything and values given are actually first read of the pair?
    Get the latest samtools source and use the "samtools view -X" option and it will decode the flag field into a string representation.

    Comment


    • #3
      Originally posted by nilshomer View Post
      Get the latest samtools source and use the "samtools view -X" option and it will decode the flag field into a string representation.
      Umm ... is there is guide that would tell me what that string representation means? As the result of that was:
      Code:
      HWI-EAS418:1:1:0:1819#0 pUr1    chr20   29366674        0       50M     =       29366674        0
      HWI-EAS418:1:1:0:1819#0 purR2   chr20   29366674        0       *       =       29366674        0
      pUr1 and purR2 doesn't really tell me much more.

      Edit: Ok you can see the explanation with 'samtools view -?'.

      p=0x1 (paired), P=0x2 (properly paired),
      u=0x4 (unmapped), U=0x8 (mate unmapped),
      r=0x10 (reverse), R=0x20 (mate reverse)
      1=0x40 (first), 2=0x80 (second)

      So first is paired, mate unmapped, reverse and 1st in pair. Second is paired, unmapped, reverse, mate reverse and 2nd in pair.

      Then should I assume that the positions for second are not real, but just copies of the first position? If so why is the positions then written to second in the first place?
      Last edited by Hena; 11-10-2009, 03:16 AM.

      Comment


      • #4
        Originally posted by Hena View Post
        Umm ... is there is guide that would tell me what that string representation means? As the result of that was:
        Code:
        HWI-EAS418:1:1:0:1819#0 pUr1    chr20   29366674        0       50M     =       29366674        0
        HWI-EAS418:1:1:0:1819#0 purR2   chr20   29366674        0       *       =       29366674        0
        pUr1 and purR2 doesn't really tell me much more.

        Edit: Ok you can see the explanation with 'samtools view -?'.

        p=0x1 (paired), P=0x2 (properly paired),
        u=0x4 (unmapped), U=0x8 (mate unmapped),
        r=0x10 (reverse), R=0x20 (mate reverse)
        1=0x40 (first), 2=0x80 (second)

        So first is paired, mate unmapped, reverse and 1st in pair. Second is paired, unmapped, reverse, mate reverse and 2nd in pair.

        Then should I assume that the positions for second are not real, but just copies of the first position? If so why is the positions then written to second in the first place?
        This is for fast retrieval of the the unmapped mate of a mapped read. Consider if you want to realign the unmapped read using the mate end read for finding breakpoints etc. It is a common convention among people who use the SAM format.

        Comment


        • #5
          So then it essentially means in my example that the second read of the pair is not mapped and it's values are not to be used for that read (though could be used for the 1st read).

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          45 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X