SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
merging fastq files shilo Illumina/Solexa 8 07-06-2016 01:15 PM
error running BWA honey Bioinformatics 0 06-01-2012 08:44 PM
Problem running BWA sampe cshowell Bioinformatics 15 04-13-2012 05:51 AM
merging 2 sam files papori Bioinformatics 0 07-29-2011 04:44 AM
merging bwa alignments arnkas Genomic Resequencing 5 07-07-2010 07:26 AM

Reply
 
Thread Tools
Old 06-24-2012, 09:03 PM   #1
bw.
Member
 
Location: San Francisco, CA

Join Date: Mar 2012
Posts: 21
Default merging fastq running bwa gives different results from running bwa then merging sam.

I have an exome seq dataset with 2x100bp paired end reads.
CASAVA 1.8 generated multiple .fastq files for each pair:

L002_R1_001.fastq.gz
L002_R1_002.fastq.gz
L002_R1_003.fastq.gz

L002_R2_001.fastq.gz
L002_R2_002.fastq.gz
L002_R2_003.fastq.gz

I ran bwa-0.6.1 aln and sampe on these fastqs and got 3 .sam files which I then merged into one sam file using picard MergeSam.
Also, I tried first concatenating the fastqs (getting L002_R1.fastq.gz and L002_R2.fastq.gz) and running bwa on these.
I was surprised to find these 2 sam files differed. diff found that over 1% of the alignments were different. Some differed in the alignment location:

7230055,7230056c7230055,7230056
< 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779503 -628 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>[email protected]???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*[email protected]*FC<H<@[email protected]<CC<G?>>FDFD;[email protected]@? X0:i:1 X1:i:0 MD:Z:100
RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
< 70:C02K1ACXX:2:1307:10184:84189 163 8 99779503 29 64S29M7S = 99780032 628 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:29 RG:Z:1 XG:i:0
AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M
---
> 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779504 -627 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>[email protected]???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*[email protected]*FC<H<@[email protected]<CC<G?>>FDFD;[email protected]@? X0:i:1 X1:i:0 MD:Z:100
RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
> 70:C02K1ACXX:2:1307:10184:84189 163 8 99779504 29 65S28M7S = 99780032 627 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:28 RG:Z:1 XG:i:0
AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M


Some differed in the reported mapping quality (even though all other SAM fields were the same):

12417331c12417331
< 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 22 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA [email protected]@@:33A:,853'96<[email protected]@IGIHEHFBC=<[email protected]>[email protected]?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>[email protected] X0:i:2 X1:i:0 XA:Z:10,-1207949
81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
---
> 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 15 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA [email protected]@@:33A:,853'96<[email protected]@IGIHEHFBC=<[email protected]>[email protected]?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>[email protected] X0:i:2 X1:i:0 XA:Z:10,-1207949
81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R


I thought bwa mapped reads one at a time - why would adding more reads to the .fastq change the results for so many alignments?
Just asking to improve my understanding of bwa.

Thanks!
-Ben
bw. is offline   Reply With Quote
Old 06-25-2012, 03:08 AM   #2
xied75
Senior Member
 
Location: Oxford

Join Date: Feb 2012
Posts: 129
Default

Ben,

1, When a read has multiple equally good hits, BWA use drand48() function to choose one as the output.
2, drand48() needs srand48() to init with a seed, BWA SAMPE always use a fixed number as the seed.
3, if you call drand48() with the same seed, you always got the same sequence of numbers. e.g. call 1: 1,3,4,8,9,10,12...90,92; call 2: 1,3,4,8,9,10,12...90,92. etc
4, for a BWA SAMPE run, srand48() is only called once.

Thus in your first run, you call BWA SAMPE on 6 sai files in 3 pairs, so you can see srand48() is called 3 times. But if you cat fastq first, you call BWA SAMPE on 2 sai files, srand48() is only called once, the first 1/3 will be the same, then you'll start to see difference, because the rest 2/3 is getting numbers generated by drand48() in a 'brand new' way.

It's something like:

before cat:
sampe call 1: 1,3,4,8,9,10,12...90,92,.....STOP
sampe call 2: 1,3,4,8,9,10,12...90,92,.....STOP
sampe call 3: 1,3,4,8,9,10,12...90,92,.....STOP

after cat:
sampe call 1: 1,3,4,8,9,10,12...90,92,.....(no longer STOP here and continues) ... 110,113,114,115,...

If you use a graphical file compare tool, like winmerge on windows, you'll see it.

Best,

dong
xied75 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO