![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
bwa:how to align color space reads | SOLiDance | Bioinformatics | 13 | 01-08-2013 06:27 AM |
bwa color space error | rcorbett | Bioinformatics | 6 | 06-24-2011 04:21 AM |
Solid formats translator(base space/color space/double encoded) | AronaldJ | SOLiD | 0 | 10-26-2010 01:10 AM |
bwa question: quality discrepancy between a color-space alignment and its csfastq | yenhuahuang1 | Bioinformatics | 4 | 03-15-2010 07:23 AM |
direct mapping of color-space data against color-space | begsch | SOLiD | 1 | 09-09-2009 10:25 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Canada Join Date: Apr 2009
Posts: 46
|
![]()
I like to build color-space indexing by bwa. The input fast should be in nucleotide space, so I use following command to index whole human genome:
>bwa index -c human.fasta But segmentation fault occurred everytime like this, [bwa_index] Pack nucleotide FASTA... 60.48 sec [bwa_index] Convert nucleotide PAC to color PAC... 31.13 sec [bwa_index] Reverse the packed sequence... 16.62 sec [bwa_index] Construct BWT for the packed sequence... Segmentation fault Can anyone tell me why that happen? thanks |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Sweden Join Date: Mar 2008
Posts: 324
|
![]()
Hi,
probably it is because you did not use the -a bwtsw option. According to the manual (bwaw.1) it is needed for human: bwtsw Algorithm implemented in BWT-SW. This is the only method that works with the whole human genome. However, this module does not work with database smaller than 10MB and it is much slower than the other two. Bwtsw algorithm trades speed for memory. |
![]() |
![]() |
![]() |
#3 |
Member
Location: Canada Join Date: Apr 2009
Posts: 46
|
![]()
Hello, Chipper
You are right, I am only able to implement whole human genome by bwtsw. Guess bwa might not be competitive for sequencing SOLiD color space data. Thanks |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Sweden Join Date: Mar 2008
Posts: 324
|
![]()
Sorry, I don't follow, why would it not be competetive for SOLiD data? It takes some time to build the index, but once you have the index it is really fast.
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Canada Join Date: Apr 2009
Posts: 46
|
![]()
Sorry, I think I mislead in my reply, what I mean was that bwa couldn't index whole human genome in color space because bwtsw is the only way to do so. I can use -c for smaller genome, like chr1, ...etc.
Since some pipeline are using SOLiD data, I was thinking to generate the human genome index in color space and as you mentioned it is going to be fast once I have all those index files. So if now I want to align the color space data to human.fasta, I would have to pick another aligner? thanks for your reply |
![]() |
![]() |
![]() |
#6 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Sweden Join Date: Mar 2008
Posts: 324
|
![]()
Nils,
I am testing it with a 50 bp dataset (23.6 M reads). As expected, aln without indels and few mismatches is very fast. 2 MM was done after ~ 15 minutes with 4 threads. 4 MM probably ~ 10 x slower but I like the option to allow more mismatches at the end (with a good seed) which should make it much faster. Would be nice to compare it to BFAST if I ever manage to build that index... Any ideas on how to set up an ideal aligner comparison test for SOLiD data? |
![]() |
![]() |
![]() |
#8 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
I have found it takes about 6 hours to build one BFAST index on a 32GB quad-core machine. Like BWA, this needs to done only once per reference (save those indexes!). The BWA index I builit did not take too long to build either. |
|
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Sweden Join Date: Mar 2008
Posts: 324
|
![]()
Mapping with 6 MM with 2 in the seed (25 bp) is ~ 3x faster than with 4 MM in the full sequence. Will try with some recent datasets tomorrow.
|
![]() |
![]() |
![]() |
#10 |
Junior Member
Location: Canada Join Date: Aug 2009
Posts: 9
|
![]()
Hello there,
I am new to this forum, but glad to see so many great discussions going on. In the past, I have been mainly using MAQ to analyze the Solexa data. A few days ago, I started trying to use BWA to analyze the SOLiD data, partly because of its claimed fast speed, partly because of some of the problems I ran into when using MAQ for SOLiD data. I am running to a problem as described below. Just wonder if I can get some help from you experts. Thanks very much in advance! (sorry for having a long message as my first post, but I think it is necessary for you to understand the problem) #Problem: #I used the following commands trying to map pair-end SOLiD data in fastq format directly downloaded from the 1000 genome project site: bwa index -a bwtsw -c hg18.fa & [bwt_gen] Finished constructing BWT in 314 iterations. [bwa_index] 2054.17 seconds elapse. #This seem to work fine bwa aln -c hg18.fa SRR003188_1.fastq >SRR003188_1.sai bwa aln -c hg18.fa SRR003188_2.fastq >SRR003188_2.sai #these are the files generated including the original read files: -rw-r--r-- 1 pliang pliang 355680085 Aug 24 12:47 SRR003188_1.fastq -rw-r--r-- 1 pliang pliang 11958400 Aug 26 21:19 SRR003188_1.sai -rw-r--r-- 1 pliang pliang 355680085 Aug 24 12:49 SRR003188_2.fastq -rw-r--r-- 1 pliang pliang 11958400 Aug 26 21:20 SRR003188_2.sai bwa sampe -a 2400 hg18.fa SRR003188_1.sai SRR003188_2.sai SRR003188_1.fastq SRR003188_2.fastq >SRR003188.sam #message [bwa_sai2sam_pe_core] convert to sequence coordinate... [infer_isize] fail to infer insert size: too few good pairs [bwa_sai2sam_pe_core] time elapses: 3.11 sec [bwa_sai2sam_pe_core] change of coordinates in 0 alignments. [bwa_sai2sam_pe_core] align unmapped mate... [bwa_sai2sam_pe_core] time elapses: 0.71 sec [bwa_sai2sam_pe_core] refine gapped alignments... 1.58 sec [bwa_sai2sam_pe_core] print alignments... 0.43 sec [bwa_sai2sam_pe_core] 262144 sequences have been processed. [bwa_sai2sam_pe_core] convert to sequence coordinate... [infer_isize] fail to infer insert size: too few good pairs #when open the .sam file, it looks like this: VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114 77 * 0 0 * * 0 0 GNNNNNNNNNNNNNNNNNNNNNNNNN !611%%(-+%*.&*.,&2,,'%( )31 VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114 141 * 0 0 * * 0 0 TNNNNNNNNNNNNNNNNNNNNNNNNN !1:7%6);%.1/<%&717'/'7: ..... #this was the same when run samse with single input. Looks like to me that the color space didn't get converted to properly, therefore not finding any match. Also, the time used for aln and sampe/samse seems to be too little to me. |
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Sweden Join Date: Mar 2008
Posts: 324
|
![]()
Hi,
bwa uses fastQ files with colors represented as ACGT, perhaps the 1000 genomes fastq files represents it as 0123? Also, bwa does not use the first color so you may have to strip it or use the solid2fastq script. |
![]() |
![]() |
![]() |
#12 |
Junior Member
Location: Canada Join Date: Aug 2009
Posts: 9
|
![]()
Hi Chipper, thanks for your response. Yes, you are right about the fastq files from the 1000 genome project. I didn't know the bwa uses only nucleotide sequence. So I will what you suggested and see how it goes. Thanks again.
|
![]() |
![]() |
![]() |
#13 | |
Senior Member
Location: SEA Join Date: Nov 2009
Posts: 203
|
![]() Quote:
ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/ Last edited by KevinLam; 12-21-2009 at 06:47 PM. |
|
![]() |
![]() |
![]() |
#14 |
Member
Location: Davis, CA Join Date: Aug 2008
Posts: 88
|
![]()
though this is an old thread, it might be important to clarify ... are you referring to another tool called 'bwtsw', separate from bwa? Chipper was referring to the bwtsw indexing option to the 'bwa index' command ...
|
![]() |
![]() |
![]() |
#15 |
Member
Location: New York Join Date: Nov 2009
Posts: 13
|
![]()
I have a project that involves aligning SoLID data to Hg18. The short reads (both pair and single ended) are provided in a fastq file that looks like this
Code:
@SRR035457.1557068 VAB_solid0148_20090522_1_AZZ_ABT_LMP_pA_0000001003227942_AZZ_ABT_LMP_pA_000000100322794288_85_1730 length=50 T003.......0230..0.0.....220..2.010.301...321..111. +SRR035457.1557068 VAB_solid0148_20090522_1_AZZ_ABT_LMP_pA_0000001003227942_AZZ_ABT_LMP_pA_000000100322794288_85_1730 length=50 !%9#!!!!!!!#-$1!!2!%!!!!!%)&!!(!*,#!$2'!!!)/+!!%2,! The solid2fastq.pl script most often refrenced in BWA literature seems to require color space data in some kind of another format (multiple files, seperate quality and color data, perhaps different quality score scaling, etc...) Can anyone provide some pointers as to how I can convert this colorspace FASTQ file to a colorspace FASTQ file that is compatible with BWA's colorspace aligner (presumably BWA's colorspace format represents colors using nucleotide letters... as opposed to converting the colorspace reads to actual nucleotides). I want to make sure that I have
Many thanks, --Brad |
![]() |
![]() |
![]() |
#16 |
Member
Location: Davis, CA Join Date: Aug 2008
Posts: 88
|
![]()
Hi Brad,
You're right about the file formats that solid2fastq is expecting ... there's a <prefix>_F3.csfasta and a <prefix>_F3_QV.qual file, and then a stats file (not needed by the script). But your fastq is still in numerical representation of colorspace, whereas the fastq that results from solid2fastq.pl is really still in colorspace, but the colors are represented (for convenience -- probably for the data structures in bwa?) as [ATCG]. So I'd recommend splitting/converting your fastq back into the csfasta and QV.qual files, and then using solid2fastq. For reference, I'll paste in the heads of some csfasta and qual files I have (below). However, I haven't yet dealt with PE SOLiD reads ... so I don't know how that's going to figure in your hacking. I don't see any "R3" references in your fastq example, which I thought was the reverse read nomenclature ... so I can't advise any further on that. head solid0180sequencer_20091218_AS75_1_AS75_F3.csfasta Code:
# Thu Dec 24 14:34:04 2009 /share/apps/corona/bin/filter_fasta.pl --output=/data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/results.F1B1/primary.20091224210709876 --name=solid0180sequencer_20091218_AS75_1_AS75 --tag=F3 --minlength=50 --prefix=T /data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/jobs/postPrimerSetPrimary.206/rawseq # Cwd: /home/pipeline # Title: solid0180sequencer_20091218_AS75_1_AS75 >2_14_29_F3 T22301223212203112133123302220112001120102111112031 >2_14_37_F3 T00031010222021220101011121221111121121112121222212 >2_14_79_F3 T01231023120123320111231011110221102001010122213211 >2_14_97_F3 Code:
# Thu Dec 24 14:34:04 2009 /share/apps/corona/bin/filter_fasta.pl --output=/data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/results.F1B1/primary.20091224210709876 --name=solid0180sequencer_20091218_AS75_1_AS75 --tag=F3 --minlength=50 --prefix=T /data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/jobs/postPrimerSetPrimary.206/rawseq # Cwd: /home/pipeline # Title: solid0180sequencer_20091218_AS75_1_AS75 >2_14_29_F3 4 24 8 12 24 4 26 4 11 4 4 6 4 4 4 4 4 4 4 4 4 8 4 4 4 4 4 0 0 4 4 4 0 0 4 4 4 0 4 0 4 4 0 0 4 4 4 0 0 4 >2_14_37_F3 6 13 6 7 6 7 20 4 4 4 13 18 4 4 4 12 9 4 4 4 4 20 4 4 4 5 11 0 4 4 4 5 4 4 4 4 5 0 0 4 4 4 0 0 4 4 4 0 0 0 >2_14_79_F3 8 25 7 6 9 8 30 6 4 5 4 21 4 4 4 7 20 4 4 4 4 21 4 0 4 4 17 4 0 4 4 17 4 0 4 4 11 0 0 4 0 15 0 0 4 4 6 0 0 4 >2_14_97_F3 ~Joe |
![]() |
![]() |
![]() |
#17 |
Member
Location: New York Join Date: Nov 2009
Posts: 13
|
![]()
I was able to obtain the original .csfasta and .qual files from the scientists that generated them and convert them using solid2fastq.pl from BWA v 0.5.7. Along the way I had to fix 3 minor bugs in the script. I'm surer there are better ways to do this (I am nto a perl programmer) but I have included the altered script in case it is of use to anyone else.
@Joe, Thanks for the examples and advice! -- Brad Code:
#!/usr/bin/perl -w # Author: lh3 # Note: Ideally, this script should be written in C. It is a bit slow at present. # Also note that this script is different from the one contained in MAQ. # Three bugs fixed: bgulko 2010-04-06 # 1-replace -1 with 0 BEFORE trimming first quality value per line # 2-ignore comment lines that start with # # 3-ignore comments in name fields, ie characters after the "," in XXXX_YYYY_ZZZZ_F3,...... use strict; use warnings; use Getopt::Std; my %opts; my $version = '0.1.3'; my $usage = qq{ Usage: solid2fastq.pl <in.title> <out.prefix> Note: <in.title> is the string showed in the `# Title:' line of a ".csfasta" read file. Then <in.title>F3.csfasta is read sequence file and <in.title>F3_QV.qual is the quality file. If <in.title>R3.csfasta is present, this script assumes reads are paired; otherwise reads will be regarded as single-end. The read name will be <out.prefix>:panel_x_y/[12] with `1' for R3 tag and `2' for F3. Usually you may want to use short <out.prefix> to save diskspace. Long <out.prefix> also causes troubles to maq. }; getopts('', \%opts); die($usage) if (@ARGV != 2); my ($title, $pre) = @ARGV; my (@fhr, @fhw); my @fn_suff = ('F3.csfasta', 'F3_QV.qual', 'R3.csfasta', 'R3_QV.qual'); my $is_paired = (-f "$title$fn_suff[2]" || -f "$title$fn_suff[2].gz")? 1 : 0; if ($is_paired) { # paired end for (0 .. 3) { my $fn = "$title$fn_suff[$_]"; $fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz"); open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n"); } open($fhw[0], "|gzip >$pre.read1.fastq.gz") || die; # this is NOT a typo open($fhw[1], "|gzip >$pre.read2.fastq.gz") || die; open($fhw[2], "|gzip >$pre.single.fastq.gz") || die; my (@df, @dr); @df = &read1(1); @dr = &read1(2); while (@df && @dr) { if ($df[0] eq $dr[0]) { # mate pair print {$fhw[0]} $df[1]; print {$fhw[1]} $dr[1]; @df = &read1(1); @dr = &read1(2); } else { if ($df[0] le $dr[0]) { print {$fhw[2]} $df[1]; @df = &read1(1); } else { print {$fhw[2]} $dr[1]; @dr = &read1(2); } } } if (@df) { print {$fhw[2]} $df[1]; while (@df = &read1(1, $fhr[0], $fhr[1])) { print {$fhw[2]} $df[1]; } } if (@dr) { print {$fhw[2]} $dr[1]; while (@dr = &read1(2, $fhr[2], $fhr[3])) { print {$fhw[2]} $dr[1]; } } close($fhr[$_]) for (0 .. $#fhr); close($fhw[$_]) for (0 .. $#fhw); } else { # single end for (0 .. 1) { my $fn = "$title$fn_suff[$_]"; $fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz"); open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n"); } open($fhw[2], "|gzip >$pre.single.fastq.gz") || die; my @df; while (@df = &read1(1, $fhr[0], $fhr[1])) { print {$fhw[2]} $df[1]; } close($fhr[$_]) for (0 .. $#fhr); close($fhw[2]); } sub read1 { my $i = shift(@_); my $j = ($i-1)<<1; my ($key, $seq); my ($fhs, $fhq) = ($fhr[$j], $fhr[$j|1]); my ($a, $b, $c, $key2); while (<$fhs>) { while (substr( $_,0,1) eq '#') { <$fhs>; }; my $t = <$fhq>; while (substr( $t,0,1) eq '#') { $t = <$fhq>; } if (/^>(\d+)_(\d+)_(\d+)_[FR]3/) { $key = sprintf("%.4d_%.4d_%.4d", $1, $2, $3); # this line could be improved on 64-bit machines ( $a, $b, $c ) = split( /_/, substr($t,1) ); $key2 = sprintf("%.4d_%.4d_%.4d", $a, $b, $c); # this line could be improved on 64-bit machines die(qq/** unmatched read name: '$_' != '$t' , '$key', '$key2'\n/) unless ($key eq $key2); my $name = "$pre:$1_$2_$3/$i"; $_ = substr(<$fhs>, 2); tr/0123./ACGTN/; my $s = $_; $_ = <$fhq>; s/-1\b/0/eg; s/^(\d+)\s*//; s/(\d+)\s*/chr($1+33)/eg; $seq = qq/\@$name\n$s+\n$_\n/; last; } } return defined($seq)? ($key, $seq) : (); } |
![]() |
![]() |
![]() |
Thread Tools | |
|
|