SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   merging fastq running bwa gives different results from running bwa then merging sam. (http://seqanswers.com/forums/showthread.php?t=21112)

bw. 06-24-2012 09:03 PM

merging fastq running bwa gives different results from running bwa then merging sam.
 
I have an exome seq dataset with 2x100bp paired end reads.
CASAVA 1.8 generated multiple .fastq files for each pair:

L002_R1_001.fastq.gz
L002_R1_002.fastq.gz
L002_R1_003.fastq.gz

L002_R2_001.fastq.gz
L002_R2_002.fastq.gz
L002_R2_003.fastq.gz

I ran bwa-0.6.1 aln and sampe on these fastqs and got 3 .sam files which I then merged into one sam file using picard MergeSam.
Also, I tried first concatenating the fastqs (getting L002_R1.fastq.gz and L002_R2.fastq.gz) and running bwa on these.
I was surprised to find these 2 sam files differed. diff found that over 1% of the alignments were different. Some differed in the alignment location:

7230055,7230056c7230055,7230056
< 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779503 -628 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>5@???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*9@*FC<H<@HFF@IHFDC<CC<G?>>FDFD;D7DD@@? X0:i:1 X1:i:0 MD:Z:100
RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
< 70:C02K1ACXX:2:1307:10184:84189 163 8 99779503 29 64S29M7S = 99780032 628 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:29 RG:Z:1 XG:i:0
AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M
---
> 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779504 -627 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>5@???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*9@*FC<H<@HFF@IHFDC<CC<G?>>FDFD;D7DD@@? X0:i:1 X1:i:0 MD:Z:100
RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
> 70:C02K1ACXX:2:1307:10184:84189 163 8 99779504 29 65S28M7S = 99780032 627 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:28 RG:Z:1 XG:i:0
AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M


Some differed in the reported mapping quality (even though all other SAM fields were the same):

12417331c12417331
< 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 22 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA C@@@:33A:,853'96<?7=?CEC=@DHDC@IGIHEHFBC=<GG@F>E9??@?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>F8C=DDD=17@ X0:i:2 X1:i:0 XA:Z:10,-1207949
81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
---
> 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 15 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA C@@@:33A:,853'96<?7=?CEC=@DHDC@IGIHEHFBC=<GG@F>E9??@?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>F8C=DDD=17@ X0:i:2 X1:i:0 XA:Z:10,-1207949
81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R


I thought bwa mapped reads one at a time - why would adding more reads to the .fastq change the results for so many alignments?
Just asking to improve my understanding of bwa.

Thanks!
-Ben

xied75 06-25-2012 03:08 AM

Ben,

1, When a read has multiple equally good hits, BWA use drand48() function to choose one as the output.
2, drand48() needs srand48() to init with a seed, BWA SAMPE always use a fixed number as the seed.
3, if you call drand48() with the same seed, you always got the same sequence of numbers. e.g. call 1: 1,3,4,8,9,10,12...90,92; call 2: 1,3,4,8,9,10,12...90,92. etc
4, for a BWA SAMPE run, srand48() is only called once.

Thus in your first run, you call BWA SAMPE on 6 sai files in 3 pairs, so you can see srand48() is called 3 times. But if you cat fastq first, you call BWA SAMPE on 2 sai files, srand48() is only called once, the first 1/3 will be the same, then you'll start to see difference, because the rest 2/3 is getting numbers generated by drand48() in a 'brand new' way.

It's something like:

before cat:
sampe call 1: 1,3,4,8,9,10,12...90,92,.....STOP
sampe call 2: 1,3,4,8,9,10,12...90,92,.....STOP
sampe call 3: 1,3,4,8,9,10,12...90,92,.....STOP

after cat:
sampe call 1: 1,3,4,8,9,10,12...90,92,.....(no longer STOP here and continues) ... 110,113,114,115,...

If you use a graphical file compare tool, like winmerge on windows, you'll see it.

Best,

dong


All times are GMT -8. The time now is 11:56 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.