View Single Post
Old 06-24-2012, 09:03 PM   #1
bw.
Member
 
Location: San Francisco, CA

Join Date: Mar 2012
Posts: 21
Default merging fastq running bwa gives different results from running bwa then merging sam.

I have an exome seq dataset with 2x100bp paired end reads.
CASAVA 1.8 generated multiple .fastq files for each pair:

L002_R1_001.fastq.gz
L002_R1_002.fastq.gz
L002_R1_003.fastq.gz

L002_R2_001.fastq.gz
L002_R2_002.fastq.gz
L002_R2_003.fastq.gz

I ran bwa-0.6.1 aln and sampe on these fastqs and got 3 .sam files which I then merged into one sam file using picard MergeSam.
Also, I tried first concatenating the fastqs (getting L002_R1.fastq.gz and L002_R2.fastq.gz) and running bwa on these.
I was surprised to find these 2 sam files differed. diff found that over 1% of the alignments were different. Some differed in the alignment location:

7230055,7230056c7230055,7230056
< 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779503 -628 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>5@???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*9@*FC<H<@HFF@IHFDC<CC<G?>>FDFD;D7DD@@? X0:i:1 X1:i:0 MD:Z:100
RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
< 70:C02K1ACXX:2:1307:10184:84189 163 8 99779503 29 64S29M7S = 99780032 628 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:29 RG:Z:1 XG:i:0
AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M
---
> 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779504 -627 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>5@???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*9@*FC<H<@HFF@IHFDC<CC<G?>>FDFD;D7DD@@? X0:i:1 X1:i:0 MD:Z:100
RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
> 70:C02K1ACXX:2:1307:10184:84189 163 8 99779504 29 65S28M7S = 99780032 627 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:28 RG:Z:1 XG:i:0
AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M


Some differed in the reported mapping quality (even though all other SAM fields were the same):

12417331c12417331
< 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 22 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA C@@@:33A:,853'96<?7=?CEC=@DHDC@IGIHEHFBC=<GG@F>E9??@?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>F8C=DDD=17@ X0:i:2 X1:i:0 XA:Z:10,-1207949
81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
---
> 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 15 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA C@@@:33A:,853'96<?7=?CEC=@DHDC@IGIHEHFBC=<GG@F>E9??@?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>F8C=DDD=17@ X0:i:2 X1:i:0 XA:Z:10,-1207949
81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R


I thought bwa mapped reads one at a time - why would adding more reads to the .fastq change the results for so many alignments?
Just asking to improve my understanding of bwa.

Thanks!
-Ben
bw. is offline   Reply With Quote