Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging fastq running bwa gives different results from running bwa then merging sam.

    I have an exome seq dataset with 2x100bp paired end reads.
    CASAVA 1.8 generated multiple .fastq files for each pair:

    L002_R1_001.fastq.gz
    L002_R1_002.fastq.gz
    L002_R1_003.fastq.gz

    L002_R2_001.fastq.gz
    L002_R2_002.fastq.gz
    L002_R2_003.fastq.gz

    I ran bwa-0.6.1 aln and sampe on these fastqs and got 3 .sam files which I then merged into one sam file using picard MergeSam.
    Also, I tried first concatenating the fastqs (getting L002_R1.fastq.gz and L002_R2.fastq.gz) and running bwa on these.
    I was surprised to find these 2 sam files differed. diff found that over 1% of the alignments were different. Some differed in the alignment location:

    7230055,7230056c7230055,7230056
    < 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779503 -628 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
    TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>5@???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*9@*FC<H<@HFF@IHFDC<CC<G?>>FDFD;D7DD@@? X0:i:1 X1:i:0 MD:Z:100
    RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
    < 70:C02K1ACXX:2:1307:10184:84189 163 8 99779503 29 64S29M7S = 99780032 628 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
    TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:29 RG:Z:1 XG:i:0
    AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M
    ---
    > 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779504 -627 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
    TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>5@???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*9@*FC<H<@HFF@IHFDC<CC<G?>>FDFD;D7DD@@? X0:i:1 X1:i:0 MD:Z:100
    RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
    > 70:C02K1ACXX:2:1307:10184:84189 163 8 99779504 29 65S28M7S = 99780032 627 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
    TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:28 RG:Z:1 XG:i:0
    AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M


    Some differed in the reported mapping quality (even though all other SAM fields were the same):

    12417331c12417331
    < 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 22 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
    GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA C@@@:33A:,853'96<?7=?CEC=@DHDC@IGIHEHFBC=<GG@F>E9??@?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>F8C=DDD=17@ X0:i:2 X1:i:0 XA:Z:10,-1207949
    81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
    ---
    > 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 15 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
    GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA C@@@:33A:,853'96<?7=?CEC=@DHDC@IGIHEHFBC=<GG@F>E9??@?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>F8C=DDD=17@ X0:i:2 X1:i:0 XA:Z:10,-1207949
    81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R


    I thought bwa mapped reads one at a time - why would adding more reads to the .fastq change the results for so many alignments?
    Just asking to improve my understanding of bwa.

    Thanks!
    -Ben

  • #2
    Ben,

    1, When a read has multiple equally good hits, BWA use drand48() function to choose one as the output.
    2, drand48() needs srand48() to init with a seed, BWA SAMPE always use a fixed number as the seed.
    3, if you call drand48() with the same seed, you always got the same sequence of numbers. e.g. call 1: 1,3,4,8,9,10,12...90,92; call 2: 1,3,4,8,9,10,12...90,92. etc
    4, for a BWA SAMPE run, srand48() is only called once.

    Thus in your first run, you call BWA SAMPE on 6 sai files in 3 pairs, so you can see srand48() is called 3 times. But if you cat fastq first, you call BWA SAMPE on 2 sai files, srand48() is only called once, the first 1/3 will be the same, then you'll start to see difference, because the rest 2/3 is getting numbers generated by drand48() in a 'brand new' way.

    It's something like:

    before cat:
    sampe call 1: 1,3,4,8,9,10,12...90,92,.....STOP
    sampe call 2: 1,3,4,8,9,10,12...90,92,.....STOP
    sampe call 3: 1,3,4,8,9,10,12...90,92,.....STOP

    after cat:
    sampe call 1: 1,3,4,8,9,10,12...90,92,.....(no longer STOP here and continues) ... 110,113,114,115,...

    If you use a graphical file compare tool, like winmerge on windows, you'll see it.

    Best,

    dong

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:37 PM
    0 responses
    11 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 06:07 PM
    0 responses
    10 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    51 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    68 views
    0 likes
    Last Post seqadmin  
    Working...
    X