Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bw.
    Member
    • Mar 2012
    • 21

    merging fastq running bwa gives different results from running bwa then merging sam.

    I have an exome seq dataset with 2x100bp paired end reads.
    CASAVA 1.8 generated multiple .fastq files for each pair:

    L002_R1_001.fastq.gz
    L002_R1_002.fastq.gz
    L002_R1_003.fastq.gz

    L002_R2_001.fastq.gz
    L002_R2_002.fastq.gz
    L002_R2_003.fastq.gz

    I ran bwa-0.6.1 aln and sampe on these fastqs and got 3 .sam files which I then merged into one sam file using picard MergeSam.
    Also, I tried first concatenating the fastqs (getting L002_R1.fastq.gz and L002_R2.fastq.gz) and running bwa on these.
    I was surprised to find these 2 sam files differed. diff found that over 1% of the alignments were different. Some differed in the alignment location:

    7230055,7230056c7230055,7230056
    < 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779503 -628 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
    TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>5@???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*9@*FC<H<@HFF@IHFDC<CC<G?>>FDFD;D7DD@@? X0:i:1 X1:i:0 MD:Z:100
    RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
    < 70:C02K1ACXX:2:1307:10184:84189 163 8 99779503 29 64S29M7S = 99780032 628 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
    TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:29 RG:Z:1 XG:i:0
    AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M
    ---
    > 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779504 -627 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
    TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>5@???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*9@*FC<H<@HFF@IHFDC<CC<G?>>FDFD;D7DD@@? X0:i:1 X1:i:0 MD:Z:100
    RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
    > 70:C02K1ACXX:2:1307:10184:84189 163 8 99779504 29 65S28M7S = 99780032 627 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
    TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:28 RG:Z:1 XG:i:0
    AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M


    Some differed in the reported mapping quality (even though all other SAM fields were the same):

    12417331c12417331
    < 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 22 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
    GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA C@@@:33A:,853'96<?7=?CEC=@DHDC@IGIHEHFBC=<GG@F>E9??@?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>F8C=DDD=17@ X0:i:2 X1:i:0 XA:Z:10,-1207949
    81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
    ---
    > 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 15 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
    GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA C@@@:33A:,853'96<?7=?CEC=@DHDC@IGIHEHFBC=<GG@F>E9??@?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>F8C=DDD=17@ X0:i:2 X1:i:0 XA:Z:10,-1207949
    81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R


    I thought bwa mapped reads one at a time - why would adding more reads to the .fastq change the results for so many alignments?
    Just asking to improve my understanding of bwa.

    Thanks!
    -Ben
  • xied75
    Senior Member
    • Feb 2012
    • 129

    #2
    Ben,

    1, When a read has multiple equally good hits, BWA use drand48() function to choose one as the output.
    2, drand48() needs srand48() to init with a seed, BWA SAMPE always use a fixed number as the seed.
    3, if you call drand48() with the same seed, you always got the same sequence of numbers. e.g. call 1: 1,3,4,8,9,10,12...90,92; call 2: 1,3,4,8,9,10,12...90,92. etc
    4, for a BWA SAMPE run, srand48() is only called once.

    Thus in your first run, you call BWA SAMPE on 6 sai files in 3 pairs, so you can see srand48() is called 3 times. But if you cat fastq first, you call BWA SAMPE on 2 sai files, srand48() is only called once, the first 1/3 will be the same, then you'll start to see difference, because the rest 2/3 is getting numbers generated by drand48() in a 'brand new' way.

    It's something like:

    before cat:
    sampe call 1: 1,3,4,8,9,10,12...90,92,.....STOP
    sampe call 2: 1,3,4,8,9,10,12...90,92,.....STOP
    sampe call 3: 1,3,4,8,9,10,12...90,92,.....STOP

    after cat:
    sampe call 1: 1,3,4,8,9,10,12...90,92,.....(no longer STOP here and continues) ... 110,113,114,115,...

    If you use a graphical file compare tool, like winmerge on windows, you'll see it.

    Best,

    dong

    Comment

    Latest Articles

    Collapse

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by SEQadmin2, Today, 10:09 AM
    0 responses
    8 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, Yesterday, 08:59 AM
    0 responses
    14 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-02-2026, 12:03 PM
    0 responses
    23 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-02-2026, 11:40 AM
    0 responses
    20 views
    0 reactions
    Last Post SEQadmin2  
    Working...