Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa sampe hanging

    I'll apologize for asking what amounts to a pre-question; I know I really don't have a complete description to the problem but I'm a little stumped how to get some more useful descriptive information.

    I am running bwa (0.5.8a) on 4 GAIIx lanes of paired-end human sequence. The machine is a 64bit x86 machine with 32Gb of RAM and running Oracle Enterprise Linux (aka Red Hat Enterprise Linux 5, a sore subject, but the state of things).

    Generation of the .sai files using "bwa align" works fine, but for three out of the four lanes, the program is hanging during the "bwa sampe" stage. As far as I can tell, it will stay running for hours with no further output.

    If I split the FASTQ files at about the last sequence which is output, then the program completes for each fraction and I can merge the alignment SAM files with samtools -- but it's definitely an extra step I'd prefer to avoid. But that does suggest that it isn't a simple aberrant FASTQ entry which is the trigger.

    Any suggestions for further info I should spelunk that would be useful for troubleshooting this? Is there a good way to determine whether the .sai files are somehow corrupt? Anyone seen something (an odd character?) in a FASTQ file which can sometimes be troublesome?

    thanks in advance

  • #2
    I've had sampe hang before when the pairs were not lined up correctly in the two files. Since splitting fixes your issue, this is probably not what's going on, but it doesn't hurt to check.

    Comment


    • #3
      I'm having similar problems, but with samse. The aln step works fine for me, but samse hangs for tens of hours and only writes [bwa_read_seq] 0.0% bases are trimmed. to my *.sam file.

      Prior to running BWA, my pipeline includes a python script to convert the Illumina quality scores to phred33, and a perl script to trim reads that contain adapter sequence. I have also tried it without running my perl trimming script, but samse still hangs.

      Any advice on what else I can try to identify the problem is much appreciated.

      Comment


      • #4
        Thanks for the feedback -- that does give me the idea of trying bwa samse on each original file to see if none, one or both of the paired end files causes trouble.

        I'll also re-check that the two files have the same ids in the same order.

        THANKS!!

        Comment


        • #5
          One more possible cookie crumb: the end of log files for hung runs generally end with either (highlighting mine)

          [infer_isize] (25, 50, 75) percentile: (21322, 49144, 77320)
          [infer_isize] low and high boundaries: 36 and 189316 for estimating avg and std
          [infer_isize] inferred external isize from 21 pairs: 46521.000 +/- 26509.447
          [infer_isize] skewness: 0.214; kurtosis: -0.983; ap_prior: 1.00e-05
          [infer_isize] inferred maximum insert size: 207433 (6.07 sigma)
          [bwa_sai2sam_pe_core] time elapses: 72.34 sec
          [bwa_sai2sam_pe_core] changing coordinates of 3124 alignments.
          [bwa_sai2sam_pe_core] align unmapped mate...


          OR

          [infer_isize] fail to infer insert size: too few good pairs
          [bwa_sai2sam_pe_core] time elapses: 77.51 sec
          [bwa_sai2sam_pe_core] changing coordinates of 3054 alignments.
          [bwa_sai2sam_pe_core] align unmapped mate...


          Which suggests that for some reason a have a large patch of sequences which don't align & are confusing the insert size calculator. However, it must be said that in failed runs there are spots like this that it gets through (but perhaps very slowly; I am letting a run go for several days extra over the weekend just to see if it ever exits).

          Oddly, every time I've tried to split a file into two parts they both complete in reasonable time, even when the breakpoint is near where the full run fails.

          I am curious why bwa sampe is recomputing the insert size distribution so many times -- it would be surprising if that varied through a run (but then again, I'm surprised to find such a big stretch of fragments that don't imply one). Perhaps failed infer_isize batches should cause the reuse of a previously computed batch?

          I think I'll gin up some courage soon to look at the source code & perhaps even try the above suggestion.

          Comment


          • #6
            Hi Krobison,

            I am having exactly the same problem you were having.
            Did you get to know how to solve it?

            Thank you
            Originally posted by krobison View Post
            One more possible cookie crumb: the end of log files for hung runs generally end with either (highlighting mine)

            [infer_isize] (25, 50, 75) percentile: (21322, 49144, 77320)
            [infer_isize] low and high boundaries: 36 and 189316 for estimating avg and std
            [infer_isize] inferred external isize from 21 pairs: 46521.000 +/- 26509.447
            [infer_isize] skewness: 0.214; kurtosis: -0.983; ap_prior: 1.00e-05
            [infer_isize] inferred maximum insert size: 207433 (6.07 sigma)
            [bwa_sai2sam_pe_core] time elapses: 72.34 sec
            [bwa_sai2sam_pe_core] changing coordinates of 3124 alignments.
            [bwa_sai2sam_pe_core] align unmapped mate...


            OR

            [infer_isize] fail to infer insert size: too few good pairs
            [bwa_sai2sam_pe_core] time elapses: 77.51 sec
            [bwa_sai2sam_pe_core] changing coordinates of 3054 alignments.
            [bwa_sai2sam_pe_core] align unmapped mate...


            Which suggests that for some reason a have a large patch of sequences which don't align & are confusing the insert size calculator. However, it must be said that in failed runs there are spots like this that it gets through (but perhaps very slowly; I am letting a run go for several days extra over the weekend just to see if it ever exits).

            Oddly, every time I've tried to split a file into two parts they both complete in reasonable time, even when the breakpoint is near where the full run fails.

            I am curious why bwa sampe is recomputing the insert size distribution so many times -- it would be surprising if that varied through a run (but then again, I'm surprised to find such a big stretch of fragments that don't imply one). Perhaps failed infer_isize batches should cause the reuse of a previously computed batch?

            I think I'll gin up some courage soon to look at the source code & perhaps even try the above suggestion.

            Comment


            • #7
              OP, bwa sampe is a very slow step. It took my cluster more than two days to convert the two sai files into one sam file. My reads are 100 million in size.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Working...
              X