Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA causing shut down?

    has anyone else experienced a server shut when running bwa align on more than 20 cores?

  • #2
    We have just encountered a similar incident with our 16 cores server 2 days agao. Our IT is still looking into it but they have no luck so far.

    Comment


    • #3
      Can you give the details of what hardware/OS you are running this into.. storage.. kernel version, libs, distritbution, etc... ?
      It'd interesting to see if there is correlation..
      -drd

      Comment


      • #4
        jasonbcold,

        Well this is a common thing I hear from many people doing analysis work on multi CPU/core machines. Well according me if the software is parallelized/optimized for shared/distributed memory architecture in the right way then it should not cup if you scale up! Let me know how are achieving the parallelism on many cores?? I think I can help you if you can share info on the setup (h/w and s/w)

        Comment


        • #5
          Hi drio and geschickten,
          the following bwa command works fine on our server:

          bwa aln -e 5 -t 15 [ref.fa] [reads.fastq] > [alignment.sai]

          but when we increase the processor number to 25 (of 32 cores):

          bwa aln -e 5 -t 25 [ref.fa] [reads.fastq] > [alignment.sai]
          the sever shuts down reproducably.

          And here are the hardware software specifications:
          sba@solexa:~$ uname -a
          Linux solexa 2.6.30-2-amd64 #1 SMP Fri Sep 25 22:16:56 UTC 2009 x86_64 GNU/Linux

          Distribution is Debian.

          Machine consists of 8 Quad-Core AMD Opteron Processors 8380 (thus, a total of 32 cores).

          Many hard discs attached to

          RAID bus controller: 3ware Inc 9650SE SATA-II RAID PCIe (rev 01)
          RAID bus controller: 3ware Inc 9690SA SAS/SATA-II RAID PCIe (rev 01)

          that are configured as JBOD. The filesystem is ext4.

          Comment


          • #6
            Hi jasonbcold,

            I guess it's nothing to do with H/w; if you see the pthreads part of the BWA code you will understand.. I still have to dive deep into the code but at a high level I can guess the code does not guarantee to scale with many CPUs/cores; well that's why you have paradigms like OpenMP and MPI... anyways if you need professional help then we can customize the s/w for your needs... let me know if you are interested.

            Comment


            • #7
              I never run bwa aln with more than 8 CPUs and so do not have the problem. It seems to me that a program working with 15 CPUs should always work with 25 CPUs. I do not know why bwa fails here. I will seek more advice on this issue.

              Comment


              • #8
                Our system is fairly similar:

                Linux pipeline 2.6.18-128.el5_BITS_XFS #1 SMP Thu Sep 3 17:05:45 BST 2009 x86_64 x86_64 x86_64 GNU/Linux

                16 cores Intel(R) Xeon(R) CPU X7350 @ 2.93GHz

                32GB memory.

                The crash also occurred during the bwa aln step when we were using 12 of the CPU.

                Comment


                • #9
                  No problems on our 24 core (48 thread) 128GB RAM servers, on which we use bwa at least weekly.

                  Linux mhh-bio02 2.6.32.43-0.4-default #1 SMP 2011-07-14 14:47:44 +0200 x86_64 x86_64 x86_64 GNU/Linux

                  Also Xeons Intel(R) Xeon(R) CPU E7540 @ 2.00GHz'

                  Comment


                  • #10
                    No problem running 24 threads on 12 physical cores (dual 6-core Xeon X5690 with hyperthreading). Linux 2.6.35-32-generic #66-Ubuntu SMP (x86_64), BWA 0.5.9-r16.

                    Comment


                    • #11
                      Here's what's going on ...

                      BWA is built for speed. A good thing.
                      To make it fast, BWA skips error checking. Not good but if you *know* this, then you can deal with the problems.

                      I can grep for "malloc" in the source and the 2 lines after "malloc" (-A2) ...
                      ___________________________________________________
                      -bash-3.00$ grep -A2 malloc *.c
                      bwape.c: z->a = (bwtint_t*)malloc(sizeof(bwtint_t) * z->n);
                      bwape.c- for (l = r->k; l <= r->l; ++l)
                      bwape.c- z->a[l - r->k] = r->a? bwt_sa(bwt[0], l) : bwt[1]->seq_len - (bwt_sa(bwt[1], l) + p[j]->len);
                      --
                      cs2nt.c: ta = (uint8_t*)malloc(len * 7);
                      cs2nt.c- nt_ref = ta;
                      cs2nt.c- cs_read = nt_ref + len;
                      --
                      is.c: } else if ((C = B = (int *) malloc(k * sizeof(int))) == NULL) return -2;
                      is.c- getCounts(T, C, n, k, cs);
                      is.c- getBuckets(C, B, k, 1); /* find ends of buckets */
                      --
                      is.c: } else if ((C = B = (int *) malloc(k * sizeof(int))) == NULL) return -2;
                      is.c- /* put all left-most S characters into their buckets */
                      is.c- getCounts(T, C, n, k, cs);
                      --
                      simple_dp.c: p->s = (unsigned char*)malloc(p->l + 1);
                      simple_dp.c- memcpy(p->s, seq->seq.s, p->l);
                      simple_dp.c- p->s[p->l] = 0;
                      --
                      stdaln.c: aa = (AlnAln*)malloc(sizeof(AlnAln));
                      stdaln.c- aa->path = 0;
                      stdaln.c- aa->out1 = aa->out2 = aa->outm = 0;
                      --
                      stdaln.c: dpcell = (dpcell_t**)malloc(sizeof(dpcell_t*) * (len2 + 1));
                      stdaln.c- for (j = 0; j <= len2; ++j)
                      stdaln.c: dpcell[j] = (dpcell_t*)malloc(sizeof(dpcell_t) * end);
                      stdaln.c- for (j = b2 + 1; j <= len2; ++j)
                      stdaln.c- dpcell[j] -= j - b2;
                      stdaln.c: curr = (dpscore_t*)malloc(sizeof(dpscore_t) * (len1 + 1));
                      stdaln.c: last = (dpscore_t*)malloc(sizeof(dpscore_t) * (len1 + 1));
                      stdaln.c-
                      stdaln.c- /* set first row */
                      --
                      stdaln.c: suba = (int*)malloc(sizeof(int) * (len2 + 1));
                      stdaln.c: eh = (NT_LOCAL_SCORE*)malloc(sizeof(NT_LOCAL_SCORE) * (len1 + 1));
                      stdaln.c: s_array = (int**)malloc(sizeof(int*) * N_MATRIX_ROW);
                      stdaln.c- for (i = 0; i != N_MATRIX_ROW; ++i)
                      stdaln.c: s_array[i] = (int*)malloc(sizeof(int) * len1);
                      stdaln.c- /* initialization */
                      stdaln.c- aln_init_score_array(seq1, len1, N_MATRIX_ROW, score_matrix, s_array);
                      --
                      stdaln.c: seq11 = (unsigned char*)malloc(sizeof(unsigned char) * len1);
                      stdaln.c: seq22 = (unsigned char*)malloc(sizeof(unsigned char) * len2);
                      stdaln.c: aa->path = (path_t*)malloc(sizeof(path_t) * (len1 + len2 + 1));
                      stdaln.c-
                      stdaln.c- if (ap->row < 10) { /* 4-nucleotide alignment */
                      --
                      stdaln.c: out1 = aa->out1 = (char*)malloc(sizeof(char) * (aa->path_len + 1));
                      stdaln.c: out2 = aa->out2 = (char*)malloc(sizeof(char) * (aa->path_len + 1));
                      stdaln.c: outm = aa->outm = (char*)malloc(sizeof(char) * (aa->path_len + 1));
                      stdaln.c-
                      stdaln.c- --seq1; --seq2;
                      --
                      stdaln.c: cigar = (uint32_t*)malloc(*n_cigar * 4);
                      stdaln.c-
                      stdaln.c- cigar[0] = 1u << 4 | path[path_len-1].ctype;
                      __________________________________________________________

                      Notice how the return value from malloc is not checked? If there's plenty of memory ... no problem. If you're running 8 bwas and some other users are doing other stuff and one of the input files has wierd stuff and .... memory usage spikes .. and suddenly there's no more memory: malloc fails and ... undefined.

                      Sad truth is, your 4 core, 8GB system can't always handle it.

                      When BWA locks up your system, just dial it back a little or try a bigger memoried box.


                      Similarly we can look at fread() function ...
                      _____

                      -bash-3.00$ grep -A3 fread *.c
                      bwape.c: fread(&n_aln, 4, 1, fp_sa[j]);
                      bwape.c- if (n_aln > kv_max(d->aln[j]))
                      bwape.c- kv_resize(bwt_aln1_t, d->aln[j], n_aln);
                      bwape.c- d->aln[j].n = n_aln;
                      bwape.c: fread(d->aln[j].a, sizeof(bwt_aln1_t), n_aln, fp_sa[j]);
                      bwape.c- kv_copy(bwt_aln1_t, buf[j][i].aln, d->aln[j]); // backup d->aln[j]
                      bwape.c- // generate SE alignment and mapping quality
                      bwape.c- bwa_aln2seq(n_aln, d->aln[j].a, p[j]);
                      --
                      bwape.c: fread(pacseq, 1, bns->l_pac/4+1, bns->fp_pac);
                      bwape.c- } else pacseq = (ubyte_t*)_pacseq;
                      bwape.c- if (!popt->is_sw || ii->avg < 0.0) return pacseq;
                      bwape.c-
                      --
                      bwape.c: fread(&opt, sizeof(gap_opt_t), 1, fp_sa[0]);
                      bwape.c- ks[0] = bwa_open_reads(opt.mode, fn_fa[0]);
                      bwape.c- opt0 = opt;
                      bwape.c: fread(&opt, sizeof(gap_opt_t), 1, fp_sa[1]); // overwritten!
                      bwape.c- ks[1] = bwa_open_reads(opt.mode, fn_fa[1]);
                      bwape.c- if (!(opt.mode & BWA_MODE_COMPREAD)) {
                      bwape.c- popt->type = BWA_PET_SOLID;
                      --
                      bwape.c: fread(pac, 1, bns->l_pac/4+1, bns->fp_pac);
                      bwape.c- }
                      bwape.c- }
                      bwape.c-
                      --
                      bwase.c: fread(ntpac, 1, ntbns->l_pac/4 + 1, ntbns->fp_pac);
                      bwase.c- }
                      bwase.c-
                      bwase.c- if (!_pacseq) {
                      --
                      bwase.c: fread(pacseq, 1, bns->l_pac/4+1, bns->fp_pac);
                      bwase.c- } else pacseq = _pacseq;
                      bwase.c- for (i = 0; i != n_seqs; ++i) {
                      bwase.c- bwa_seq_t *s = seqs + i;
                      --
                      bwase.c: fread(&opt, sizeof(gap_opt_t), 1, fp_sa);
                      bwase.c- if (!(opt.mode & BWA_MODE_COMPREAD)) // in color space; initialize ntpac
                      bwase.c- ntbns = bwa_open_nt(prefix);
                      bwase.c- bwa_print_sam_SQ(bns);
                      --
                      bwase.c: fread(&n_aln, 4, 1, fp_sa);
                      bwase.c- if (n_aln > m_aln) {
                      bwase.c- m_aln = n_aln;
                      bwase.c- aln = (bwt_aln1_t*)realloc(aln, sizeof(bwt_aln1_t) * m_aln);
                      --
                      bwase.c: fread(aln, sizeof(bwt_aln1_t), n_aln, fp_sa);
                      bwase.c- bwa_aln2seq_core(n_aln, aln, p, 1, n_occ);
                      bwase.c- }
                      bwase.c-
                      --
                      bwtio.c: fread(&primary, sizeof(bwtint_t), 1, fp);
                      bwtio.c- xassert(primary == bwt->primary, "SA-BWT inconsistency: primary is not the same.");
                      bwtio.c: fread(skipped, sizeof(bwtint_t), 4, fp); // skip
                      bwtio.c: fread(&bwt->sa_intv, sizeof(bwtint_t), 1, fp);
                      bwtio.c: fread(&primary, sizeof(bwtint_t), 1, fp);
                      bwtio.c- xassert(primary == bwt->seq_len, "SA-BWT inconsistency: seq_len is not the same.");
                      bwtio.c-
                      bwtio.c- bwt->n_sa = (bwt->seq_len + bwt->sa_intv) / bwt->sa_intv;
                      --
                      bwtio.c: fread(bwt->sa + 1, sizeof(bwtint_t), bwt->n_sa - 1, fp);
                      bwtio.c- fclose(fp);
                      bwtio.c-}
                      bwtio.c-
                      --
                      bwtio.c: fread(&bwt->primary, sizeof(bwtint_t), 1, fp);
                      bwtio.c: fread(bwt->L2+1, sizeof(bwtint_t), 4, fp);
                      bwtio.c: fread(bwt->bwt, 4, bwt->bwt_size, fp);
                      bwtio.c- bwt->seq_len = bwt->L2[4];
                      bwtio.c- fclose(fp);
                      bwtio.c- bwt_gen_cnt_table(bwt);
                      --
                      bwtmisc.c: fread(&c, 1, 1, fp);
                      bwtmisc.c- fclose(fp);
                      bwtmisc.c- return (pac_len - 1) * 4 + (int)c;
                      bwtmisc.c-}
                      --
                      bwtmisc.c: fread(buf2, 1, pac_size, fp);
                      bwtmisc.c- fclose(fp);
                      bwtmisc.c- memset(bwt->L2, 0, 5 * 4);
                      bwtmisc.c- buf = (ubyte_t*)calloc(bwt->seq_len + 1, 1);
                      --
                      bwtmisc.c: fread(bufin, 1, pac_len, fp);
                      bwtmisc.c- fclose(fp);
                      bwtmisc.c- for (i = seq_len - 1, j = 0; i >= 0; --i) {
                      bwtmisc.c- int c = bufin[i>>2] >> ((~i&3)<<1) & 3;
                      --
                      bwtmisc.c: fread(pac, 1, bns->l_pac/4+1, bns->fp_pac);
                      bwtmisc.c- rewind(bns->fp_pac);
                      bwtmisc.c- c1 = pac[0]>>6; cspac[0] = c1<<6;
                      bwtmisc.c- for (i = 1; i < bns->l_pac; ++i) {
                      --
                      bwtsw2_aux.c: fread(pac, 1, bns->l_pac/4+1, bns->fp_pac);
                      bwtsw2_aux.c- fp = xzopen(fn, "r");
                      bwtsw2_aux.c- ks = kseq_init(fp);
                      bwtsw2_aux.c- _seq = calloc(1, sizeof(bsw2seq_t));

                      ___________

                      Note that there's no checking the return value from fread() Did it succeed? Not sure but the code assumes it does. Many many strange errors in BWA are because the user feeds bad input into it.

                      ________ bottom line is this
                      1) if it locks, run on a bigger box
                      2) check you inputs
                      Last edited by Richard Finney; 03-06-2012, 02:48 PM.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        Yesterday, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      58 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      54 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      45 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X