Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging fastq running bwa gives different results from running bwa then merging sam.

    I have an exome seq dataset with 2x100bp paired end reads.
    CASAVA 1.8 generated multiple .fastq files for each pair:

    L002_R1_001.fastq.gz
    L002_R1_002.fastq.gz
    L002_R1_003.fastq.gz

    L002_R2_001.fastq.gz
    L002_R2_002.fastq.gz
    L002_R2_003.fastq.gz

    I ran bwa-0.6.1 aln and sampe on these fastqs and got 3 .sam files which I then merged into one sam file using picard MergeSam.
    Also, I tried first concatenating the fastqs (getting L002_R1.fastq.gz and L002_R2.fastq.gz) and running bwa on these.
    I was surprised to find these 2 sam files differed. diff found that over 1% of the alignments were different. Some differed in the alignment location:

    7230055,7230056c7230055,7230056
    < 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779503 -628 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
    TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>5@???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*9@*FC<H<@HFF@IHFDC<CC<G?>>FDFD;D7DD@@? X0:i:1 X1:i:0 MD:Z:100
    RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
    < 70:C02K1ACXX:2:1307:10184:84189 163 8 99779503 29 64S29M7S = 99780032 628 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
    TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:29 RG:Z:1 XG:i:0
    AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M
    ---
    > 70:C02K1ACXX:2:1307:10184:84189 83 8 99780032 29 100M = 99779504 -627 TAAAACCCTCAGCTGGGGACCATATTAACATCAAAGTTAGAGAAGCAAGGAA
    TCTGCAGCCATCTGCTTCTACCTGATTTACAGATCCAAGGTCTCTTTT ########A>;>5@???=.3;C===.4D=44CGEEEFB.<FAIGCFF<EFGFB?DF=GD?:*9@*FC<H<@HFF@IHFDC<CC<G?>>FDFD;D7DD@@? X0:i:1 X1:i:0 MD:Z:100
    RG:Z:1 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:U
    > 70:C02K1ACXX:2:1307:10184:84189 163 8 99779504 29 65S28M7S = 99780032 627 AGGGCAAGCAAAGAAATAAAATCTCTTCATACATGTCACATTGC
    TGATGTATGGAAACTTATTTGATTATTTCCTGAAGATCTGATTCAACAGTAACCTG @1?++BBDFB?+:A+2A+4A?FD,2+A2A####################################################################### MD:Z:28 RG:Z:1 XG:i:0
    AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 MQ:i:29 XT:A:M


    Some differed in the reported mapping quality (even though all other SAM fields were the same):

    12417331c12417331
    < 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 22 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
    GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA C@@@:33A:,853'96<?7=?CEC=@DHDC@IGIHEHFBC=<GG@F>E9??@?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>F8C=DDD=17@ X0:i:2 X1:i:0 XA:Z:10,-1207949
    81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R
    ---
    > 70:C02K1ACXX:2:2206:18974:136356 147 2 70498808 15 100M = 70498555 -353 AAATCTAAAGAAAACCCCACCGTATCCAATGTTTGTAATGTTTACTTCAAGT
    GAATCTAAAAAATTTCTAAATTTAGTTTATGGTAAATACCTCAAAATA C@@@:33A:,853'96<?7=?CEC=@DHDC@IGIHEHFBC=<GG@F>E9??@?<HGJHHDDIHCDAGHDHCEBBGGHEGH>DAEFHFB>F8C=DDD=17@ X0:i:2 X1:i:0 XA:Z:10,-1207949
    81,100M,0; MD:Z:100 XG:i:0 AM:i:0 NM:i:0 SM:i:0 XM:i:0 XO:i:0 XT:A:R


    I thought bwa mapped reads one at a time - why would adding more reads to the .fastq change the results for so many alignments?
    Just asking to improve my understanding of bwa.

    Thanks!
    -Ben

  • #2
    Ben,

    1, When a read has multiple equally good hits, BWA use drand48() function to choose one as the output.
    2, drand48() needs srand48() to init with a seed, BWA SAMPE always use a fixed number as the seed.
    3, if you call drand48() with the same seed, you always got the same sequence of numbers. e.g. call 1: 1,3,4,8,9,10,12...90,92; call 2: 1,3,4,8,9,10,12...90,92. etc
    4, for a BWA SAMPE run, srand48() is only called once.

    Thus in your first run, you call BWA SAMPE on 6 sai files in 3 pairs, so you can see srand48() is called 3 times. But if you cat fastq first, you call BWA SAMPE on 2 sai files, srand48() is only called once, the first 1/3 will be the same, then you'll start to see difference, because the rest 2/3 is getting numbers generated by drand48() in a 'brand new' way.

    It's something like:

    before cat:
    sampe call 1: 1,3,4,8,9,10,12...90,92,.....STOP
    sampe call 2: 1,3,4,8,9,10,12...90,92,.....STOP
    sampe call 3: 1,3,4,8,9,10,12...90,92,.....STOP

    after cat:
    sampe call 1: 1,3,4,8,9,10,12...90,92,.....(no longer STOP here and continues) ... 110,113,114,115,...

    If you use a graphical file compare tool, like winmerge on windows, you'll see it.

    Best,

    dong

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin


      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
      Yesterday, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    39 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    41 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    35 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    55 views
    0 likes
    Last Post seqadmin  
    Working...
    X