Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa sampe hanging

    I'll apologize for asking what amounts to a pre-question; I know I really don't have a complete description to the problem but I'm a little stumped how to get some more useful descriptive information.

    I am running bwa (0.5.8a) on 4 GAIIx lanes of paired-end human sequence. The machine is a 64bit x86 machine with 32Gb of RAM and running Oracle Enterprise Linux (aka Red Hat Enterprise Linux 5, a sore subject, but the state of things).

    Generation of the .sai files using "bwa align" works fine, but for three out of the four lanes, the program is hanging during the "bwa sampe" stage. As far as I can tell, it will stay running for hours with no further output.

    If I split the FASTQ files at about the last sequence which is output, then the program completes for each fraction and I can merge the alignment SAM files with samtools -- but it's definitely an extra step I'd prefer to avoid. But that does suggest that it isn't a simple aberrant FASTQ entry which is the trigger.

    Any suggestions for further info I should spelunk that would be useful for troubleshooting this? Is there a good way to determine whether the .sai files are somehow corrupt? Anyone seen something (an odd character?) in a FASTQ file which can sometimes be troublesome?

    thanks in advance

  • #2
    I've had sampe hang before when the pairs were not lined up correctly in the two files. Since splitting fixes your issue, this is probably not what's going on, but it doesn't hurt to check.

    Comment


    • #3
      I'm having similar problems, but with samse. The aln step works fine for me, but samse hangs for tens of hours and only writes [bwa_read_seq] 0.0% bases are trimmed. to my *.sam file.

      Prior to running BWA, my pipeline includes a python script to convert the Illumina quality scores to phred33, and a perl script to trim reads that contain adapter sequence. I have also tried it without running my perl trimming script, but samse still hangs.

      Any advice on what else I can try to identify the problem is much appreciated.

      Comment


      • #4
        Thanks for the feedback -- that does give me the idea of trying bwa samse on each original file to see if none, one or both of the paired end files causes trouble.

        I'll also re-check that the two files have the same ids in the same order.

        THANKS!!

        Comment


        • #5
          One more possible cookie crumb: the end of log files for hung runs generally end with either (highlighting mine)

          [infer_isize] (25, 50, 75) percentile: (21322, 49144, 77320)
          [infer_isize] low and high boundaries: 36 and 189316 for estimating avg and std
          [infer_isize] inferred external isize from 21 pairs: 46521.000 +/- 26509.447
          [infer_isize] skewness: 0.214; kurtosis: -0.983; ap_prior: 1.00e-05
          [infer_isize] inferred maximum insert size: 207433 (6.07 sigma)
          [bwa_sai2sam_pe_core] time elapses: 72.34 sec
          [bwa_sai2sam_pe_core] changing coordinates of 3124 alignments.
          [bwa_sai2sam_pe_core] align unmapped mate...


          OR

          [infer_isize] fail to infer insert size: too few good pairs
          [bwa_sai2sam_pe_core] time elapses: 77.51 sec
          [bwa_sai2sam_pe_core] changing coordinates of 3054 alignments.
          [bwa_sai2sam_pe_core] align unmapped mate...


          Which suggests that for some reason a have a large patch of sequences which don't align & are confusing the insert size calculator. However, it must be said that in failed runs there are spots like this that it gets through (but perhaps very slowly; I am letting a run go for several days extra over the weekend just to see if it ever exits).

          Oddly, every time I've tried to split a file into two parts they both complete in reasonable time, even when the breakpoint is near where the full run fails.

          I am curious why bwa sampe is recomputing the insert size distribution so many times -- it would be surprising if that varied through a run (but then again, I'm surprised to find such a big stretch of fragments that don't imply one). Perhaps failed infer_isize batches should cause the reuse of a previously computed batch?

          I think I'll gin up some courage soon to look at the source code & perhaps even try the above suggestion.

          Comment


          • #6
            Hi Krobison,

            I am having exactly the same problem you were having.
            Did you get to know how to solve it?

            Thank you
            Originally posted by krobison View Post
            One more possible cookie crumb: the end of log files for hung runs generally end with either (highlighting mine)

            [infer_isize] (25, 50, 75) percentile: (21322, 49144, 77320)
            [infer_isize] low and high boundaries: 36 and 189316 for estimating avg and std
            [infer_isize] inferred external isize from 21 pairs: 46521.000 +/- 26509.447
            [infer_isize] skewness: 0.214; kurtosis: -0.983; ap_prior: 1.00e-05
            [infer_isize] inferred maximum insert size: 207433 (6.07 sigma)
            [bwa_sai2sam_pe_core] time elapses: 72.34 sec
            [bwa_sai2sam_pe_core] changing coordinates of 3124 alignments.
            [bwa_sai2sam_pe_core] align unmapped mate...


            OR

            [infer_isize] fail to infer insert size: too few good pairs
            [bwa_sai2sam_pe_core] time elapses: 77.51 sec
            [bwa_sai2sam_pe_core] changing coordinates of 3054 alignments.
            [bwa_sai2sam_pe_core] align unmapped mate...


            Which suggests that for some reason a have a large patch of sequences which don't align & are confusing the insert size calculator. However, it must be said that in failed runs there are spots like this that it gets through (but perhaps very slowly; I am letting a run go for several days extra over the weekend just to see if it ever exits).

            Oddly, every time I've tried to split a file into two parts they both complete in reasonable time, even when the breakpoint is near where the full run fails.

            I am curious why bwa sampe is recomputing the insert size distribution so many times -- it would be surprising if that varied through a run (but then again, I'm surprised to find such a big stretch of fragments that don't imply one). Perhaps failed infer_isize batches should cause the reuse of a previously computed batch?

            I think I'll gin up some courage soon to look at the source code & perhaps even try the above suggestion.

            Comment


            • #7
              OP, bwa sampe is a very slow step. It took my cluster more than two days to convert the two sai files into one sam file. My reads are 100 million in size.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM
              • seqadmin
                Recent Advances in Sequencing Technologies
                by seqadmin



                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                Long-Read Sequencing
                Long-read sequencing has seen remarkable advancements,...
                12-02-2024, 01:49 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              33 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              48 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              34 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-11-2024, 07:45 AM
              0 responses
              46 views
              0 likes
              Last Post seqadmin  
              Working...
              X