SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bwa sampe error Mali Salmon Bioinformatics 14 10-27-2014 11:25 AM
bwa sampe hanging krobison Bioinformatics 6 02-13-2013 12:57 PM
BWA sampe problem patel Bioinformatics 9 10-24-2011 05:11 AM
bwa sampe problem sszelinger Bioinformatics 3 06-15-2011 05:34 AM
bwa sampe 0.5.7 error? rcorbett Bioinformatics 2 04-22-2010 07:13 AM

Reply
 
Thread Tools
Old 05-27-2011, 06:41 AM   #1
natpokah
Junior Member
 
Location: NYC

Join Date: May 2010
Posts: 5
Default bwa sampe very slow

We are trying to run sampe on sai files generated from fastq files obtained with hiseq2000. There is about 100 million pair ended 104 bases reads.
we have estimated insert size at about 200 to 500 depending on libraries.

we are running the following command line
bwa sampe -P -a 300 (or -a 600). We also tried -o values of 1, 100 or 100000.
but bwa runs really slowly. so far it has has run for 4 days and generated files about 1Gb. bwa is install on HPC system that should have the CPU and RAM it needs.
Any idea about what is causing the slow behavior?
natpokah is offline   Reply With Quote
Old 05-27-2011, 10:05 AM   #2
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Unfortunately BWA will ignore your -a specification if it believes it can estimate a better insert size. Could you post some of the stderr output? My guess is there's a huge insert size estimate being used which will lead to a long lag during "align unmapped mates".

Try the -A (capital A) parameter. This disables Smith-Waterman mate rescuing and will likely speed up your sampe run.
dp05yk is offline   Reply With Quote
Old 05-27-2011, 10:19 AM   #3
natpokah
Junior Member
 
Location: NYC

Join Date: May 2010
Posts: 5
Default

Hi there!
Thanks for a quick reply!
You are totally right, the insert size estimate is huge and then sampe gets stuck hours trying to "align unmapped mate".
I will follow your advice and try the -A option.

Thanks!
natpokah is offline   Reply With Quote
Old 05-27-2011, 10:22 AM   #4
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Yeah, it's a weird feature. Using the -a parameter, it should be clear that you want to override the program's estimating process, but for some reason the -a parameter becomes a fallback value.
dp05yk is offline   Reply With Quote
Old 05-27-2011, 10:27 AM   #5
natpokah
Junior Member
 
Location: NYC

Join Date: May 2010
Posts: 5
Default

Hi again!
So, -A worked wonders! Thank you!
Would the fact of sorting the fastq by ID name help speed up even more?
Also, since I filter reads that have to many "N" and I trim the adaptors, one read might not have a corresponding mate in the 2nd fastq. Would removing those reads help?
Thanks
natpokah is offline   Reply With Quote
Old 05-27-2011, 10:41 AM   #6
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Quote:
Would the fact of sorting the fastq by ID name help speed up even more?
As long as the nth read in each mate file are actually mates for all n, then you're fine.

Quote:
Also, since I filter reads that have to many "N" and I trim the adaptors, one read might not have a corresponding mate in the 2nd fastq. Would removing those reads help?
This is absolutely your problem. If you filter one read that has too many N's, you _need_ to also remove its mate-pair.

BWA assumes your two mate files have an equal number of reads and that read N in one file corresponds to read N in the other file in terms of being a mate-pair. If these files are not setup this way, it's no wonder you're getting such huge insert size estimates.

I'd be wary of preprocessing your FASTQ files for the above reasons - additionally, BWA will not be affected by many Ns in your input reads since Ns are treated as mismatches and these reads will be quickly thrown out anyways.
dp05yk is offline   Reply With Quote
Old 05-27-2011, 10:52 AM   #7
natpokah
Junior Member
 
Location: NYC

Join Date: May 2010
Posts: 5
Default

Yeah, the only reason to remove the reads with too many N before aligning was for our statistics down the line. We wanted to have meaningful % of unique match. And since sometimes the fastq comes with a lot of junk, calculating the % based on the whole set of reads would decrease artificially the % of Unique.
We basically wanted to start "fresh" with 'mappable" reads.
This is going to be useful if we run the samse on each end individually. But you are right, I will start from the raw fastq, remove my adapters and make sure I have the same number of reads in both fastq.
Thank you so much for your precious help!
natpokah is offline   Reply With Quote
Old 05-27-2011, 10:56 AM   #8
dp05yk
Member
 
Location: Brock University

Join Date: Dec 2010
Posts: 66
Default

Yes - for samse, it's not really a big deal, hack up your FASTQs! The problems with FASTQ file mods are when you're pairing, because BWA will take each read from each file in order, and assume that each subsequent read from file one is a mate with the same subsequent read from file two.
dp05yk is offline   Reply With Quote
Old 09-26-2011, 09:22 AM   #9
cbeck
Junior Member
 
Location: Montreal

Join Date: Sep 2011
Posts: 6
Default

Heyo,
I guess I am getting the same issues as Nat was on our Illumina indexed runs -

[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (10801, 27183, 54186)
[infer_isize] low and high boundaries: 76 and 140956 for estimating avg and std
[infer_isize] inferred external isize from 1185 pairs: 34258.846 +/- 27441.256
[infer_isize] skewness: 0.654; kurtosis: -0.699; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 200827 (6.07 sigma)

Those are some huge inserts. So since I am not doing any filtering of reads at this point - do you think my reads might be getting out-of-sync because they are demultiplexed? The only thing I am doing to the qseq files is deindexing, running qseq2fastq.pl and then running bwa aln. I'm using -A as a test right now to see if it brings things down from the 5 days or so each indexed sample was taking.
cbeck is offline   Reply With Quote
Old 09-26-2011, 09:43 AM   #10
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

BWA isn't name aware, so if the reads are out of parity during bwa sampe, it will try to model two partners which aren't proper pairs, which most likely aren't in the same neighborhood in the correct direction, so it will default to the Smith-waterman local search, which is very expensive computationally.
rskr is offline   Reply With Quote
Old 09-26-2011, 09:49 AM   #11
cbeck
Junior Member
 
Location: Montreal

Join Date: Sep 2011
Posts: 6
Default

Hi rskr,
Can you think of a reason why the partners wouldn't be pairs?
cbeck is offline   Reply With Quote
Old 09-26-2011, 09:57 AM   #12
cbeck
Junior Member
 
Location: Montreal

Join Date: Sep 2011
Posts: 6
Default

Err, rather, since they aren't and since it is illumina's fault, would you recommend running picard's fastqtosam to sort and then samtofastq?
cbeck is offline   Reply With Quote
Old 09-26-2011, 09:59 AM   #13
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by cbeck View Post
Hi rskr,
Can you think of a reason why the partners wouldn't be pairs?
I have seen it when one of the pairs was quality filtered but the other then it gets replaced with whatever was next in the file so, it not longer matches.

1.1 1.2
2.1 2.2
3.1 3.2
4.1 5.2 <--4.2 was omitted, they are no longer in parity.
5.1 6.2
rskr is offline   Reply With Quote
Old 06-08-2012, 08:45 AM   #14
naxin
Junior Member
 
Location: USA

Join Date: Jul 2010
Posts: 6
Default

Could you tell me how to use -A option, it said invalid option -- 'A'. many thanks.
naxin is offline   Reply With Quote
Old 06-08-2012, 01:26 PM   #15
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

The other possibility; doublecheck that your -aln command line was right. If you accidently put a typo in one of your fastq names, and one fastq doesn't actually get aligned, sampe proceeds along anyway, and it returns crazy large insert sizes. So try running samse on each of your individual fastqs. You want to know that they are working, and you want to know if the two files are in sync.
swbarnes2 is offline   Reply With Quote
Old 06-08-2012, 01:51 PM   #16
naxin
Junior Member
 
Location: USA

Join Date: Jul 2010
Posts: 6
Default

Quote:
Originally Posted by swbarnes2 View Post
The other possibility; doublecheck that your -aln command line was right. If you accidently put a typo in one of your fastq names, and one fastq doesn't actually get aligned, sampe proceeds along anyway, and it returns crazy large insert sizes. So try running samse on each of your individual fastqs. You want to know that they are working, and you want to know if the two files are in sync.
Thanks a lot!

I am trying it right now.
naxin is offline   Reply With Quote
Old 06-09-2012, 10:05 AM   #17
swNGS
Member
 
Location: SW UK

Join Date: Nov 2011
Posts: 83
Default

I had a very similar problem which was very helpfully fixed using Trimmomatic and TrimGalore as detailed in this thread:
http://seqanswers.com/forums/showthread.php?t=19874
The author of TrimGalore was particularly accommodating in modifying the script to allow different trimming of R1 and R2.
swNGS is offline   Reply With Quote
Old 06-07-2013, 08:10 AM   #18
bwubb
Member
 
Location: Philadelphia

Join Date: Jan 2012
Posts: 55
Default

Does discarding the size estimate affect anything with the read data, the quality, or any potential variant calls?

I am trying to determine if I should use the -A option for all of my data or if there is a way to dynamically determine that sampe will take forever and the -A option should be used.

Thanks.
bwubb is offline   Reply With Quote
Old 07-10-2013, 11:10 AM   #19
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Default

Quote:
Originally Posted by rskr View Post
I have seen it when one of the pairs was quality filtered but the other then it gets replaced with whatever was next in the file so, it not longer matches.

1.1 1.2
2.1 2.2
3.1 3.2
4.1 5.2 <--4.2 was omitted, they are no longer in parity.
5.1 6.2

I have a question regarding using the -A option in the case above. If the reads are out of sync, as is the case between 4.1 and 5.2, bwa will not perform SW on the unmapped mate. What happens after that? will 5.1 and 6.2 be thrown away also bc they do not match...etc? I guess what I am asking is, is it dangerous to use -A and force bwa to throw away unmatched pairs. Are we losing important data by doing this? And is the mismatch something that carries on to all the reads after the mismatch?
dGho is offline   Reply With Quote
Old 07-10-2013, 01:21 PM   #20
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

-A should really only be used if you know that your files are lined up right, and you know that the insert sizes won't properly match what bwa is expecting.

Fix your fastqs. You can pull out the singletons, align them separately, then combine the bams.
swbarnes2 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:53 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO