SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   bwa sampe hanging (http://seqanswers.com/forums/showthread.php?t=6340)

krobison 08-09-2010 09:48 AM

bwa sampe hanging
 
I'll apologize for asking what amounts to a pre-question; I know I really don't have a complete description to the problem but I'm a little stumped how to get some more useful descriptive information.

I am running bwa (0.5.8a) on 4 GAIIx lanes of paired-end human sequence. The machine is a 64bit x86 machine with 32Gb of RAM and running Oracle Enterprise Linux (aka Red Hat Enterprise Linux 5, a sore subject, but the state of things).

Generation of the .sai files using "bwa align" works fine, but for three out of the four lanes, the program is hanging during the "bwa sampe" stage. As far as I can tell, it will stay running for hours with no further output.

If I split the FASTQ files at about the last sequence which is output, then the program completes for each fraction and I can merge the alignment SAM files with samtools -- but it's definitely an extra step I'd prefer to avoid. But that does suggest that it isn't a simple aberrant FASTQ entry which is the trigger.

Any suggestions for further info I should spelunk that would be useful for troubleshooting this? Is there a good way to determine whether the .sai files are somehow corrupt? Anyone seen something (an odd character?) in a FASTQ file which can sometimes be troublesome?

thanks in advance

raela 08-10-2010 04:49 AM

I've had sampe hang before when the pairs were not lined up correctly in the two files. Since splitting fixes your issue, this is probably not what's going on, but it doesn't hurt to check.

PeteH 08-10-2010 04:09 PM

I'm having similar problems, but with samse. The aln step works fine for me, but samse hangs for tens of hours and only writes [bwa_read_seq] 0.0% bases are trimmed. to my *.sam file.

Prior to running BWA, my pipeline includes a python script to convert the Illumina quality scores to phred33, and a perl script to trim reads that contain adapter sequence. I have also tried it without running my perl trimming script, but samse still hangs.

Any advice on what else I can try to identify the problem is much appreciated.

krobison 08-10-2010 08:11 PM

Thanks for the feedback -- that does give me the idea of trying bwa samse on each original file to see if none, one or both of the paired end files causes trouble.

I'll also re-check that the two files have the same ids in the same order.

THANKS!!

krobison 08-12-2010 03:36 AM

One more possible cookie crumb: the end of log files for hung runs generally end with either (highlighting mine)

[infer_isize] (25, 50, 75) percentile: (21322, 49144, 77320)
[infer_isize] low and high boundaries: 36 and 189316 for estimating avg and std
[infer_isize] inferred external isize from 21 pairs: 46521.000 +/- 26509.447
[infer_isize] skewness: 0.214; kurtosis: -0.983; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 207433 (6.07 sigma)
[bwa_sai2sam_pe_core] time elapses: 72.34 sec
[bwa_sai2sam_pe_core] changing coordinates of 3124 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...


OR

[infer_isize] fail to infer insert size: too few good pairs
[bwa_sai2sam_pe_core] time elapses: 77.51 sec
[bwa_sai2sam_pe_core] changing coordinates of 3054 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...


Which suggests that for some reason a have a large patch of sequences which don't align & are confusing the insert size calculator. However, it must be said that in failed runs there are spots like this that it gets through (but perhaps very slowly; I am letting a run go for several days extra over the weekend just to see if it ever exits).

Oddly, every time I've tried to split a file into two parts they both complete in reasonable time, even when the breakpoint is near where the full run fails.

I am curious why bwa sampe is recomputing the insert size distribution so many times -- it would be surprising if that varied through a run (but then again, I'm surprised to find such a big stretch of fragments that don't imply one). Perhaps failed infer_isize batches should cause the reuse of a previously computed batch?

I think I'll gin up some courage soon to look at the source code & perhaps even try the above suggestion.

modi2020 02-12-2013 03:17 PM

Hi Krobison,

I am having exactly the same problem you were having.
Did you get to know how to solve it?

Thank you
Quote:

Originally Posted by krobison (Post 23428)
One more possible cookie crumb: the end of log files for hung runs generally end with either (highlighting mine)

[infer_isize] (25, 50, 75) percentile: (21322, 49144, 77320)
[infer_isize] low and high boundaries: 36 and 189316 for estimating avg and std
[infer_isize] inferred external isize from 21 pairs: 46521.000 +/- 26509.447
[infer_isize] skewness: 0.214; kurtosis: -0.983; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 207433 (6.07 sigma)
[bwa_sai2sam_pe_core] time elapses: 72.34 sec
[bwa_sai2sam_pe_core] changing coordinates of 3124 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...


OR

[infer_isize] fail to infer insert size: too few good pairs
[bwa_sai2sam_pe_core] time elapses: 77.51 sec
[bwa_sai2sam_pe_core] changing coordinates of 3054 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...


Which suggests that for some reason a have a large patch of sequences which don't align & are confusing the insert size calculator. However, it must be said that in failed runs there are spots like this that it gets through (but perhaps very slowly; I am letting a run go for several days extra over the weekend just to see if it ever exits).

Oddly, every time I've tried to split a file into two parts they both complete in reasonable time, even when the breakpoint is near where the full run fails.

I am curious why bwa sampe is recomputing the insert size distribution so many times -- it would be surprising if that varied through a run (but then again, I'm surprised to find such a big stretch of fragments that don't imply one). Perhaps failed infer_isize batches should cause the reuse of a previously computed batch?

I think I'll gin up some courage soon to look at the source code & perhaps even try the above suggestion.


mediator 02-13-2013 12:57 PM

OP, bwa sampe is a very slow step. It took my cluster more than two days to convert the two sai files into one sam file. My reads are 100 million in size.


All times are GMT -8. The time now is 01:14 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.