Seqanswers Leaderboard Ad

**jlmlj** · 02-05-2010, 12:55 PM

Hi all,

Although nobody’s replied my post yet, I like to share some testing results of using different parameters of BWA, maybe this could be helpful for somebody or somebody could help me with these inputs.

The purpose of my testing is to allow more mismatches to see if I could have more alignments (particularly alignments with repeats) in human reference genome. I modified parameters with 6 different combinations in BWA, surprised to me that I had very similar results: 49% unique alignments, ~4% multiple alignments, and about 47% reads failed to align.

The combination I used for tests are as below:
-M 1
-k 6
-k6 -l32 -m1
-n6 -l32 -m1
-l32 –k20 –m1 (for this test, I liked to go extreme on –k to see what happened, however, it turned out with nothing changed)

I took a look at the unaligned reads. Some could be aligned by BLAT although some were not. Some of ones that could be aligned by BLAT have repeat markers. It seems I do lost some true alignments. I am wondering why I could not have these true alignments using BWA… Any help would be appreciated if you have a clue!

**lh3** · 02-05-2010, 01:13 PM

try

bwa aln -n 7 -l 1000000

This will be very slow.

**jlmlj** · 02-05-2010, 01:26 PM

Originally posted by lh3 View Post

try

bwa aln -n 7 -l 1000000

This will be very slow.

Thank you so much, very excided to get the feedback from the author of this beautiful software!

I am going to try it now.

I know -n is the max number of differences (mismatches + gaps) for the whole read length, and -l is to take the first INT as seed. However, why you set INT for -l so large, like "1000000"? Thanks in advance for the explanation!

updates:
I have run your parameters for 20mins, it seems the progress is very very slow: it's been staying at the process of the first step:
[bwa_aln_core] calculate SA coordinate... (I only have 1 line for the progress)
And it's used up all 30 nodes on our cluster. So I am thinking if it is possbile to decrease a bit the number for -l...
Thanks!

**lh3** · 02-05-2010, 07:36 PM

-l 10000 effectively disables seeding. You may try "aln -n 5". But for reads with low quality, bwa may be very slow. Its algorithm is not designed for this case.

**jlmlj** · 02-08-2010, 08:52 AM

Originally posted by lh3 View Post

-l 10000 effectively disables seeding. You may try "aln -n 5". But for reads with low quality, bwa may be very slow. Its algorithm is not designed for this case.

Hi lh3,

Thank you very much for the reply! So in this test, I disable the seed, BWA allows 7 mismatches for the total 75 read length, even for those low-quality bases, am I correct?

The test has done, it took ~49hrs with 30-node cluster. However I still have results very similar to what I had in previous tests, which means I have 48% reads failed to align to anything in the human reference genome. (I counted "XT:A:U" as unique matches, and "XT:A:R" as repeat matches in the output SAM files).

The results confuse me a lot: we should have much more repeat matches in the human genome. I am trying to figure out what unaligned reads are? It would be appreciated very much for any suggetion!

**davetang** · 08-24-2010, 03:01 AM

Dear jlmlj,

I used the parameters (bwa aln -n 7 -l 1000000) and I was able to align a read that had 5 mismatches to the reference. Running bwa on the default settings didn't report this alignment. So perhaps you can try taking one or two individual unaligned reads and do your tests again? Just a suggestion, if you haven't already done this.

As a more general note, I'm new to next-gen sequencing so I'd just like to point out something I found out. When I was looking at the sam file for this alignment, the CIGAR string was 27M and that looked like a mistake to me because I knew there were mismatches in the alignment. So I looked up the documentation, and found out that the "M" can be a sequence match or mismatch. It wasn't intuitive to me, so just thought I'd point it out.

Cheers,

Dave

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 49 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

How BWA handles mismatches?

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News