Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA insert size calculation in paired end sequencing

    Hi,
    I have a question about the "inferred insert size" field in the sam/bam files produced by BWA (this field is referred to as the "TLEN" field in the sam format spec).
    It seems like BWA computes two different things depending on the extent to which two reads from a mate pair overlap. To illustrate, I'll
    use two examples from 75bp PE RNA-seq data mapped to the genome with BWA version 0.5.9-r16. I've removed the fields after the TLEN field for clarity and
    I use "insert size" to refer to the length of sequence between the adaptors.

    Example 1:
    HS4_07580:1:2304:6332:83264#1 99 1 1276418 60 75M = 1276491 148

    Example 2:
    HS4_07580:1:2103:1916:45225#1 97 1 793910 37 75M = 793838 3

    1. Example 1 - inferred insert size is = (rightmost coordinate + read length) - leftmost coordinate : 1276491+75-1276418 = 148
    This one makes sense to me, as we get the total length of the sequence between the two adaptors, barring any indels etc. This seems to be the behaviour
    when mates overlap by <= 2bp or don't overlap at all.

    2. In example 2, the inferred insert size is: (leftmost coordinate + read length) - rightmost coordinate : 793838+75-793910 = 3
    That is, it gives the overlap of the two reads, which is not the same thing as the insert size. To me it seems like it should be
    (793910 + 75) - 793838 = 147
    This seems to be the behaviour when mates overlap by >=3bp

    Perhaps this is not so bad, because you can compute the correct insert size yourself. However, BWA also calls relatively large inserts that overlap by a small amount as "improper pairs" (i.e. defined, I think, as falling ~3sd from the mean of the empirical distribution of insert sizes that BWA computes during mapping). So, example 1 gets flagged as a "proper pair" because it has an insert size of 148bp, while example 2 gets flagged as an "improper pair" even though the (true?) insert size is 147bp. Even worse, as the overlap between two mate pairs increases they start getting flagged as "proper" again, even though,depending on your experiment, these smaller inserts are exactly the ones you might want to remove. So, for example:

    HS4_07580:1:1205:13102:38186#1 163 1 569368 46 75M = 569371 78

    This mate pair is flagged as proper even though its true length is 78bp i.e. 69bp shorter than the mate pair in example 2 above

    My questions are:

    Is there a sensible reason for this behaviour?
    Is this behaviour been documented anywhere (I can't find anything on the BWA manual page)?
    Has anybody else noticed this?

    Sorry if this is old news and has already been resolved on here.
    Thanks for any and all help
    Dan
    Last edited by dg13; 05-03-2012, 06:28 AM. Reason: adding tags, correct >=, typos

  • #2
    In the 2nd example, the orientation is wrong.

    Comment


    • #3
      Hi Heng,
      Thanks for your quick reply. I'll expand it a little for anybody else who is confused (I know I had to think a bit before I understood). There are two important points - first, we map to the + strand of the reference genome and second, sequencing is always in the 5'--->3' direction. In example 2, the read mapping to the + strand of the reference (in this case the first of the pair) is located at a higher genomic position (793910) than the read mapping to the - strand of the reference (793838). By definition, this should never happen as, if PE sequencing works as it is supposed to, we should always have one read from the + strand (reading 5'--->3') and one from the - strand (again 5'--->3'). The selection of reference strand is arbitrary, so the + strand read can be sequenced first or second, but by definition the start of the + strand read will always map to a lower genomic position than the - strand read because, in correctly oriented reads sequencing is always in the 5' -> 3' direction i.e.


      R1----->

      + 5 ---------------- 3
      - 3 ---------------- 5
      <----- R2

      or

      R2----->

      + 5 ---------------- 3
      - 3 ---------------- 5
      <----- R1


      In example 2 of the first post the + strand read maps to a higher genomic position in reference than the - strand read. Its not clear to me how this may happen, but perhaps if the reads are in a parallel orientation rather than opposing its possible. In these cases, BWA computes the overlap rather then the insert size, and flags examples like this as improper pairs. However, there are other categories of error which also get flagged as improper, for example insert sizes greater than 3sds (I think this is correct - Heng?). Usually flagging these large inserts is the right thing to do as library prep involves a size selection step, so a large "insert" indicates poor mapping of one or both reads. In the special case of mapping PE RNA-seq reads to the genome (which you might want to do to identify novel transcripts etc) you may easily get apparent large "inserts" (due to introns between two exons) in legitimate read pairs which you want to keep, so care is warranted when using the improper pair filter in QC. I'm sure this is redundant info for many of you, but I thought I'd post for any who are curious.
      Last edited by dg13; 05-16-2012, 03:04 AM.

      Comment


      • #4
        hi dg13
        Which commands did you use to get the example 1.
        I am troubling to find my real insert size in RNA seq data.
        Expecting your reply.
        Thank you.
        Last edited by jp.; 08-08-2013, 01:06 AM.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        9 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        50 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        67 views
        0 likes
        Last Post seqadmin  
        Working...
        X