Unconfigured Ad

**Chipper** · 08-02-2009, 12:16 AM

Hi,

probably it is because you did not use the -a bwtsw option. According to the manual (bwaw.1) it is needed for human:

bwtsw Algorithm implemented in BWT-SW. This is the only method that works with the whole human genome. However, this module does not work with database smaller than 10MB and it is much slower than the other two. Bwtsw algorithm trades speed for memory.

**totalnew** · 08-04-2009, 08:23 AM

Hello, Chipper

You are right, I am only able to implement whole human genome by bwtsw. Guess bwa might not be competitive for sequencing SOLiD color space data.

Thanks

**Chipper** · 08-04-2009, 11:54 AM

Sorry, I don't follow, why would it not be competetive for SOLiD data? It takes some time to build the index, but once you have the index it is really fast.

**totalnew** · 08-04-2009, 12:38 PM

Sorry, I think I mislead in my reply, what I mean was that bwa couldn't index whole human genome in color space because bwtsw is the only way to do so. I can use -c for smaller genome, like chr1, ...etc.

Since some pipeline are using SOLiD data, I was thinking to generate the human genome index in color space and as you mentioned it is going to be fast once I have all those index files. So if now I want to align the color space data to human.fasta, I would have to pick another aligner?

thanks for your reply

**nilshomer** · 08-04-2009, 12:52 PM

Originally posted by totalnew View Post

Sorry, I think I mislead in my reply, what I mean was that bwa couldn't index whole human genome in color space because bwtsw is the only way to do so. I can use -c for smaller genome, like chr1, ...etc.

Since some pipeline are using SOLiD data, I was thinking to generate the human genome index in color space and as you mentioned it is going to be fast once I have all those index files. So if now I want to align the color space data to human.fasta, I would have to pick another aligner?

thanks for your reply

I was able to index the entire human genome with BWA (bwtsw) so it is possible. I would like to certainly like to hear your experiences with longer read lengths with SOLiD data (50 and 75bp) and BWA. I have not gotten it to run as fast as other methods, especially when I try to have higher error tolerances (>10% color errors, and long indels).

**Chipper** · 08-04-2009, 01:30 PM

Nils,

I am testing it with a 50 bp dataset (23.6 M reads). As expected, aln without indels and few mismatches is very fast. 2 MM was done after ~ 15 minutes with 4 threads. 4 MM probably ~ 10 x slower but I like the option to allow more mismatches at the end (with a good seed) which should make it much faster. Would be nice to compare it to BFAST if I ever manage to build that index...

Any ideas on how to set up an ideal aligner comparison test for SOLiD data?

**nilshomer** · 08-04-2009, 01:38 PM

Originally posted by Chipper View Post

Nils,

I am testing it with a 50 bp dataset (23.6 M reads). As expected, aln without indels and few mismatches is very fast. 2 MM was done after ~ 15 minutes with 4 threads. 4 MM probably ~ 10 x slower but I like the option to allow more mismatches at the end (with a good seed) which should make it much faster. Would be nice to compare it to BFAST if I ever manage to build that index...

Any ideas on how to set up an ideal aligner comparison test for SOLiD data?

Create some simulations is the best bet. I would create a dataset composing of sets of 10K reads, each with X SNPs, Y color errors, and a Z base long indel. You can then vary X, Y, and Z to see what power you really have to detect variants and to be robust to errors (10% color error rate is not unheard of). This is what I did with BFAST, which has its own synthetic read generator, when I compared it to other aligners.

I have found it takes about 6 hours to build one BFAST index on a 32GB quad-core machine. Like BWA, this needs to done only once per reference (save those indexes!). The BWA index I builit did not take too long to build either.

**Chipper** · 08-04-2009, 02:15 PM

Mapping with 6 MM with 2 in the seed (25 bp) is ~ 3x faster than with 4 MM in the full sequence. Will try with some recent datasets tomorrow.

**pliang** · 08-27-2009, 02:08 PM

BWA: getting sequences like "GNNNNNNNNNNNNNNNNNNNNNNNNN" from SOLiD color space reads

Hello there,

I am new to this forum, but glad to see so many great discussions going on.

In the past, I have been mainly using MAQ to analyze the Solexa data. A few days ago, I started trying to use BWA to analyze the SOLiD data, partly because of its claimed fast speed, partly because of some of the problems I ran into when using MAQ for SOLiD data. I am running to a problem as described below. Just wonder if I can get some help from you experts. Thanks very much in advance! (sorry for having a long message as my first post, but I think it is necessary for you to understand the problem)

#Problem:
#I used the following commands trying to map pair-end SOLiD data in fastq format directly downloaded from the 1000 genome project site:

bwa index -a bwtsw -c hg18.fa &

[bwt_gen] Finished constructing BWT in 314 iterations.
[bwa_index] 2054.17 seconds elapse.
#This seem to work fine

bwa aln -c hg18.fa SRR003188_1.fastq >SRR003188_1.sai
bwa aln -c hg18.fa SRR003188_2.fastq >SRR003188_2.sai

#these are the files generated including the original read files:
-rw-r--r-- 1 pliang pliang 355680085 Aug 24 12:47 SRR003188_1.fastq
-rw-r--r-- 1 pliang pliang 11958400 Aug 26 21:19 SRR003188_1.sai
-rw-r--r-- 1 pliang pliang 355680085 Aug 24 12:49 SRR003188_2.fastq
-rw-r--r-- 1 pliang pliang 11958400 Aug 26 21:20 SRR003188_2.sai

bwa sampe -a 2400 hg18.fa SRR003188_1.sai SRR003188_2.sai SRR003188_1.fastq SRR003188_2.fastq >SRR003188.sam
#message
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] fail to infer insert size: too few good pairs
[bwa_sai2sam_pe_core] time elapses: 3.11 sec
[bwa_sai2sam_pe_core] change of coordinates in 0 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_sai2sam_pe_core] time elapses: 0.71 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 1.58 sec
[bwa_sai2sam_pe_core] print alignments... 0.43 sec
[bwa_sai2sam_pe_core] 262144 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] fail to infer insert size: too few good pairs

#when open the .sam file, it looks like this:
VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114 77 * 0 0 * * 0 0 GNNNNNNNNNNNNNNNNNNNNNNNNN !611%%(-+%*.&*.,&2,,'%(
)31
VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114 141 * 0 0 * * 0 0 TNNNNNNNNNNNNNNNNNNNNNNNNN !1:7%6);%.1/<%&717'/'7:
.....

#this was the same when run samse with single input. Looks like to me that the color space didn't get converted to properly, therefore not finding any match. Also, the time used for aln and sampe/samse seems to be too little to me.

**Chipper** · 08-27-2009, 10:38 PM

Hi,

bwa uses fastQ files with colors represented as ACGT, perhaps the 1000 genomes fastq files represents it as 0123? Also, bwa does not use the first color so you may have to strip it or use the solid2fastq script.

**pliang** · 08-28-2009, 06:51 AM

Hi Chipper, thanks for your response. Yes, you are right about the fastq files from the 1000 genome project. I didn't know the bwa uses only nucleotide sequence. So I will what you suggested and see how it goes. Thanks again.

**KevinLam** · 12-21-2009, 12:28 AM

Originally posted by totalnew View Post

I like to build color-space indexing by bwa. The input fast should be in nucleotide space, so I use following command to index whole human genome:

>bwa index -c human.fasta

But segmentation fault occurred everytime like this,

[bwa_index] Pack nucleotide FASTA... 60.48 sec
[bwa_index] Convert nucleotide PAC to color PAC... 31.13 sec
[bwa_index] Reverse the packed sequence... 16.62 sec
[bwa_index] Construct BWT for the packed sequence...
Segmentation fault

Can anyone tell me why that happen?

thanks

Are there pre built indexes for BWA as there are for bowtie?
ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/

**jnfass** · 02-03-2010, 03:03 PM

though this is an old thread, it might be important to clarify ... are you referring to another tool called 'bwtsw', separate from bwa? Chipper was referring to the bwtsw indexing option to the 'bwa index' command ...

**bgulko** · 03-08-2010, 12:55 PM

Converting NCBI colorspace fastq to BWA Colorspace Fastq.

I have a project that involves aligning SoLID data to Hg18. The short reads (both pair and single ended) are provided in a fastq file that looks like this

Code:

@SRR035457.1557068 VAB_solid0148_20090522_1_AZZ_ABT_LMP_pA_0000001003227942_AZZ_ABT_LMP_pA_000000100322794288_85_1730 length=50
T003.......0230..0.0.....220..2.010.301...321..111.
+SRR035457.1557068 VAB_solid0148_20090522_1_AZZ_ABT_LMP_pA_0000001003227942_AZZ_ABT_LMP_pA_000000100322794288_85_1730 length=50
!%9#!!!!!!!#-$1!!2!%!!!!!%)&!!(!*,#!$2'!!!)/+!!%2,!

Clearly this is colorspace data, and I'd like to use BWA to align it (I already have a suite of tools compatible with BWA, and this is near the end of the project, so I don't really want to switch).

The solid2fastq.pl script most often refrenced in BWA literature seems to require color space data in some kind of another format (multiple files, seperate quality and color data, perhaps different quality score scaling, etc...)

Can anyone provide some pointers as to how I can convert this colorspace FASTQ file to a colorspace FASTQ file that is compatible with BWA's colorspace aligner (presumably BWA's colorspace format represents colors using nucleotide letters... as opposed to converting the colorspace reads to actual nucleotides).

I want to make sure that I have

the correct colorspace name (0->A, 1->C, 2->G 3->T, *->N)
the correct quality score mapping and representation
allows paired reads to be correctly treated by BWA

Many thanks,
--Brad

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 39 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 62 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

bwa color-space index

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News