Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA 0.5.7 Gap Inferrence Bug - fix?

    When running version 0.5.7 of BWA sampe to generate SAM files from paired end reads, I found that it was generally working fine, but occasionally I would get messages indicating unreasonable inferred maximum insert sizes. For example,

    Code:
    [bwa_sai2sam_pe_core] convert to sequence coordinate...
    [infer_isize] (25, 50, 75) percentile: (345105019, 629160202, 1506887065)
    [infer_isize] low and high boundaries: 49 and -2147483648 for estimating avg and std
    [infer_isize] inferred external isize from 29 pairs: 776017336.207 +/- 625290497.129
    [infer_isize] skewness: 0.656; kurtosis: -0.642
    [infer_isize] inferred maximum insert size: -861465347 (4.25 sigma)
    Running version 0.5.5 on the same data , I'd get a message like
    Code:
    [infer_isize] fail to infer insert size: weird pairing
    and expect that the maximum gap size would default to the one specified by the -a argument. Normally, this kind of behavior seems to be caused by trying to match sets of reads that are unmatched or out of order (bwa expects them to be in order and matched). BWA then would use unmatched reads to infer the pair gap width and get these large numbers. I checked my data and found this was not the case, my reads were properly paired. However the SAM files indicated that some elements of a given pair were mapped to different chromosomes. I suspected that BWA 0.5.7 was not checking for excessively long gaps during its gap inference function. To work around this I modified a few lines of bwape.c in the infer_isize() function, changing

    Code:
    if (p[0]->mapQ >= 20 && p[1]->mapQ >= 20)
        isizes[tot++] = (p[0]->pos < p[1]->pos)? p[1]->pos + p[1]->len - p[0]->pos : p[0]->pos + p[0]->len - p[1]->pos;
    to

    Code:
    if (p[0]->mapQ >= 20 && p[1]->mapQ >= 20) {
        //isizes[tot++] = (p[0]->pos < p[1]->pos)? p[1]->pos + p[1]->len - p[0]->pos : p[0]->pos + p[0]->len - p[1]->pos;
        x =  (p[0]->pos < p[1]->pos)? p[1]->pos + p[1]->len - p[0]->pos : p[0]->pos + p[0]->len - p[1]->pos;
        if (x < 50000) isizes[tot++] = x;       // ignore outrageous values, particularly wrap around or different chrosomes...jh
    }
    This effectively ignores any gap > 50000 during the gap inference process. It worked just fine, generating a maximum gap length of around 6500 (typical gap about 2500). This was consistent with other data in the set. While this is clearly a hack, it might be wise to set this ceiling to something like 10* (or 100*) the maximum size as defined by the -a command.

    -- Brad

  • #2
    I am having a similar problem like yours except the whole thing locks up and I had to force quit. Another colleague of mine is reporting segmentation error.

    [bwa_sai2sam_pe_core] convert to sequence coordinate...
    [infer_isize] (25, 50, 75) percentile: (407614414, 908818089, 1560057127)
    [infer_isize] low and high boundaries: 74 and -2147483648 for estimating avg and std
    [infer_isize] inferred external isize from 103891 pairs: 880161191.398 +/- 592618141.592
    [infer_isize] skewness: 0.357; kurtosis: -0.990
    [infer_isize] inferred maximum insert size: -896179003 (4.25 sigma)
    [bwa_sai2sam_pe_core] time elapses: 19.97 sec
    [bwa_sai2sam_pe_core] changing coordinates of 4514 alignments.

    This is the first time this has happened to me and admittedly I was trying out different options in bwa aln before sampe. btw this does not happen if I use default bwa aln and I am using v0.5.7

    So I was using bwa aln -e 10 -l 36 -n 3 to create the sai file then default sampe; the above output is the first thing it dumps out ... followed by everything locking up.

    One more thing I noticed as I checked my resource usage ... the RAM usage of sampe was ridiculous as it chewed through all 14GB then all my swap mem
    Last edited by oO_Willie; 05-17-2010, 02:40 AM. Reason: more information

    Comment


    • #3
      This is similar to what was happening to me. The high memory usage and core dumps are symptoms of incorrectly inferred gap sizes. Use the above patch and you should be fine.

      Comment


      • #4
        Thanks, I'll give it a go and see if it fixes the problem.

        By the way, it wouldn't be a problem with the fastq file itself would it? I was just starting to use the FASTX-Toolkit for artifact and quality filtering on the illumina sequence.txt file before using maq's ill2sanger fastq conversion prior to bwa aln.

        Comment


        • #5
          that's didn't work out ... the output is something like what you were seeing after the fix

          [bwa_sai2sam_pe_core] convert to sequence coordinate...
          [infer_isize] (25, 50, 75) percentile: (96, 166, 10830)
          [infer_isize] low and high boundaries: 74 and 32298 for estimating avg and std
          [infer_isize] inferred external isize from 36 pairs: 5413.556 +/- 9456.975
          [infer_isize] skewness: 1.929; kurtosis: 2.402
          [infer_isize] inferred maximum insert size: 64520 (6.25 sigma)
          [bwa_sai2sam_pe_core] time elapses: 17.18 sec
          [bwa_sai2sam_pe_core] changing coordinates of 31 alignments.
          [bwa_sai2sam_pe_core] align unmapped mate...

          however, it just stays like this for half an hour then drops out. Memory and cpu usage is normal though ... not sure what's going

          Is it possible that if there are simply not enough pairs due to filtering it will give this type of odd error?

          Comment


          • #6
            Brad,

            Thanks a ton! I was seeing the same behavior (see below) and your hack seems to have fixed it. I am using pretty high quality 51-bp Illumina data and bwa v0.5.7. Will post again IF this runs into some other downstream problem....

            I agree that this should be patched more permanently in the next version update.

            Code:
            % bwa sampe -a 350 -P etc...
            [bwa_sai2sam_pe_core] convert to sequence coordinate...
            [infer_isize] (25, 50, 75) percentile: (360127461, 857663215, 1493258345)
            [infer_isize] low and high boundaries: 51 and -2147483648 for estimating avg and std
            [infer_isize] inferred external isize from 88422 pairs: 843605219.297 +/- 599049424.777
            [infer_isize] skewness: 0.383; kurtosis: -0.952
            [infer_isize] inferred maximum insert size: -905402021 (4.25 sigma)
            [bwa_sai2sam_pe_core] time elapses: 78.16 sec
            [bwa_sai2sam_pe_core] changing coordinates of 16955 alignments.
            [bwa_sai2sam_pe_core] align unmapped mate...
            
            <here bwa hangs and has to be manually aborted>
            After recompiling bwa with Brad's hack:

            Code:
            [bwa_sai2sam_pe_core] convert to sequence coordinate...
            [infer_isize] (25, 50, 75) percentile: (195, 209, 225)
            [infer_isize] low and high boundaries: 135 and 285 for estimating avg and std
            [infer_isize] inferred external isize from 2348 pairs: 206.669 +/- 20.696
            [infer_isize] skewness: -0.190; kurtosis: 1.186
            [infer_isize] inferred maximum insert size: 352 (7.02 sigma)
            [bwa_sai2sam_pe_core] time elapses: 99.67 sec
            [bwa_sai2sam_pe_core] changing coordinates of 2617 alignments.
            [bwa_sai2sam_pe_core] align unmapped mate...
            [bwa_paired_sw] 37106 out of 45234 Q17 singletons are mated.
            [bwa_paired_sw] 102020 out of 153367 Q17 discordant pairs are fixed.
            [bwa_sai2sam_pe_core] time elapses: 44.25 sec
            [bwa_sai2sam_pe_core] refine gapped alignments... 0.97 sec
            [bwa_sai2sam_pe_core] print alignments... 2.36 sec
            [bwa_sai2sam_pe_core] 262144 sequences have been processed.
            [bwa_sai2sam_pe_core] convert to sequence coordinate...
            [infer_isize] (25, 50, 75) percentile: (195, 209, 225)
            [infer_isize] low and high boundaries: 135 and 285 for estimating avg and std
            [infer_isize] inferred external isize from 2343 pairs: 207.253 +/- 20.573
            [infer_isize] skewness: -0.036; kurtosis: 1.300
            [infer_isize] inferred maximum insert size: 352 (7.02 sigma)
            [bwa_sai2sam_pe_core] time elapses: 91.20 sec
            [bwa_sai2sam_pe_core] changing coordinates of 2598 alignments.
            [bwa_sai2sam_pe_core] align unmapped mate...
            [bwa_paired_sw] 37175 out of 44986 Q17 singletons are mated.
            [bwa_paired_sw] 102470 out of 153972 Q17 discordant pairs are fixed.
            [bwa_sai2sam_pe_core] time elapses: 44.51 sec
            [bwa_sai2sam_pe_core] refine gapped alignments... 0.97 sec
            [bwa_sai2sam_pe_core] print alignments... 2.36 sec
            [bwa_sai2sam_pe_core] 524288 sequences have been processed.
            
            etc....
            Last edited by bhootnaath; 05-27-2010, 11:54 AM.

            Comment


            • #7
              The same problem happened on 9.5 million paired end ABI SOLiD reads. BWA (0.5.7 (r1310)) sampe ran for 60 hours and output 0 bytes in SAM format. From the console output below, BWA seems to have inferred wrong isize. It would be good to see patches in BWA in next version.

              .....
              [bwa_aln_core] 9450000 sequences have been processed.
              [bwa_sai2sam_pe_core] convert to sequence coordinate...
              [infer_isize] (25, 50, 75) percentile: (1517, 181767129, 1010065328)
              [infer_isize] low and high boundaries: 24 and -2147483648 for
              estimating avg and std
              [infer_isize] inferred external isize from 48657 pairs: 477246990.528
              +/- 614057333.704
              [infer_isize] skewness: 1.075; kurtosis: -0.111
              [infer_isize] inferred maximum insert size: -1201836064 (4.26 sigma)
              [bwa_sai2sam_pe_core] time elapses: 287.32 sec
              [bwa_sai2sam_pe_core] changing coordinates of 8070 alignments.
              [bwa_sai2sam_pe_core] align unmapped mate...
              ....

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 11:49 AM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-24-2024, 08:47 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              61 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Working...
              X