Unconfigured Ad

**dp05yk** · 05-27-2011, 10:05 AM

Unfortunately BWA will ignore your -a specification if it believes it can estimate a better insert size. Could you post some of the stderr output? My guess is there's a huge insert size estimate being used which will lead to a long lag during "align unmapped mates".

Try the -A (capital A) parameter. This disables Smith-Waterman mate rescuing and will likely speed up your sampe run.

**natpokah** · 05-27-2011, 10:19 AM

Hi there!
Thanks for a quick reply!
You are totally right, the insert size estimate is huge and then sampe gets stuck hours trying to "align unmapped mate".
I will follow your advice and try the -A option.

Thanks!

**dp05yk** · 05-27-2011, 10:22 AM

Yeah, it's a weird feature. Using the -a parameter, it should be clear that you want to override the program's estimating process, but for some reason the -a parameter becomes a fallback value.

**natpokah** · 05-27-2011, 10:27 AM

Hi again!
So, -A worked wonders! Thank you!
Would the fact of sorting the fastq by ID name help speed up even more?
Also, since I filter reads that have to many "N" and I trim the adaptors, one read might not have a corresponding mate in the 2nd fastq. Would removing those reads help?
Thanks

**dp05yk** · 05-27-2011, 10:41 AM

Would the fact of sorting the fastq by ID name help speed up even more?

As long as the nth read in each mate file are actually mates for all n, then you're fine.

Also, since I filter reads that have to many "N" and I trim the adaptors, one read might not have a corresponding mate in the 2nd fastq. Would removing those reads help?

This is absolutely your problem. If you filter one read that has too many N's, you _need_ to also remove its mate-pair.

BWA assumes your two mate files have an equal number of reads and that read N in one file corresponds to read N in the other file in terms of being a mate-pair. If these files are not setup this way, it's no wonder you're getting such huge insert size estimates.

I'd be wary of preprocessing your FASTQ files for the above reasons - additionally, BWA will not be affected by many Ns in your input reads since Ns are treated as mismatches and these reads will be quickly thrown out anyways.

**natpokah** · 05-27-2011, 10:52 AM

Yeah, the only reason to remove the reads with too many N before aligning was for our statistics down the line. We wanted to have meaningful % of unique match. And since sometimes the fastq comes with a lot of junk, calculating the % based on the whole set of reads would decrease artificially the % of Unique.
We basically wanted to start "fresh" with 'mappable" reads.
This is going to be useful if we run the samse on each end individually. But you are right, I will start from the raw fastq, remove my adapters and make sure I have the same number of reads in both fastq.
Thank you so much for your precious help!

**dp05yk** · 05-27-2011, 10:56 AM

Yes - for samse, it's not really a big deal, hack up your FASTQs! The problems with FASTQ file mods are when you're pairing, because BWA will take each read from each file in order, and assume that each subsequent read from file one is a mate with the same subsequent read from file two.

**cbeck** · 09-26-2011, 09:22 AM

Heyo,
I guess I am getting the same issues as Nat was on our Illumina indexed runs -

[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (10801, 27183, 54186)
[infer_isize] low and high boundaries: 76 and 140956 for estimating avg and std
[infer_isize] inferred external isize from 1185 pairs: 34258.846 +/- 27441.256
[infer_isize] skewness: 0.654; kurtosis: -0.699; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 200827 (6.07 sigma)

Those are some huge inserts. So since I am not doing any filtering of reads at this point - do you think my reads might be getting out-of-sync because they are demultiplexed? The only thing I am doing to the qseq files is deindexing, running qseq2fastq.pl and then running bwa aln. I'm using -A as a test right now to see if it brings things down from the 5 days or so each indexed sample was taking.

**rskr** · 09-26-2011, 09:43 AM

BWA isn't name aware, so if the reads are out of parity during bwa sampe, it will try to model two partners which aren't proper pairs, which most likely aren't in the same neighborhood in the correct direction, so it will default to the Smith-waterman local search, which is very expensive computationally.

**cbeck** · 09-26-2011, 09:49 AM

Hi rskr,
Can you think of a reason why the partners wouldn't be pairs?

**cbeck** · 09-26-2011, 09:57 AM

Err, rather, since they aren't and since it is illumina's fault, would you recommend running picard's fastqtosam to sort and then samtofastq?

**rskr** · 09-26-2011, 09:59 AM

Originally posted by cbeck View Post

Hi rskr,
Can you think of a reason why the partners wouldn't be pairs?

I have seen it when one of the pairs was quality filtered but the other then it gets replaced with whatever was next in the file so, it not longer matches.

1.1 1.2
2.1 2.2
3.1 3.2
4.1 5.2 <--4.2 was omitted, they are no longer in parity.
5.1 6.2

**naxin** · 06-08-2012, 08:45 AM

Could you tell me how to use -A option, it said invalid option -- 'A'. many thanks.

**swbarnes2** · 06-08-2012, 01:26 PM

The other possibility; doublecheck that your -aln command line was right. If you accidently put a typo in one of your fastq names, and one fastq doesn't actually get aligned, sampe proceeds along anyway, and it returns crazy large insert sizes. So try running samse on each of your individual fastqs. You want to know that they are working, and you want to know if the two files are in sync.

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 99 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 119 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 112 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

bwa sampe very slow

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News