SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bwa:how to align color space reads SOLiDance Bioinformatics 13 01-08-2013 05:27 AM
bwa color space error rcorbett Bioinformatics 6 06-24-2011 03:21 AM
Solid formats translator(base space/color space/double encoded) AronaldJ SOLiD 0 10-26-2010 12:10 AM
bwa question: quality discrepancy between a color-space alignment and its csfastq yenhuahuang1 Bioinformatics 4 03-15-2010 06:23 AM
direct mapping of color-space data against color-space begsch SOLiD 1 09-09-2009 09:25 PM

Reply
 
Thread Tools
Old 07-31-2009, 11:22 AM   #1
totalnew
Member
 
Location: Canada

Join Date: Apr 2009
Posts: 46
Default bwa color-space index

I like to build color-space indexing by bwa. The input fast should be in nucleotide space, so I use following command to index whole human genome:

>bwa index -c human.fasta

But segmentation fault occurred everytime like this,

[bwa_index] Pack nucleotide FASTA... 60.48 sec
[bwa_index] Convert nucleotide PAC to color PAC... 31.13 sec
[bwa_index] Reverse the packed sequence... 16.62 sec
[bwa_index] Construct BWT for the packed sequence...
Segmentation fault

Can anyone tell me why that happen?

thanks
totalnew is offline   Reply With Quote
Old 08-02-2009, 12:16 AM   #2
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Hi,

probably it is because you did not use the -a bwtsw option. According to the manual (bwaw.1) it is needed for human:

bwtsw Algorithm implemented in BWT-SW. This is the only method that works with the whole human genome. However, this module does not work with database smaller than 10MB and it is much slower than the other two. Bwtsw algorithm trades speed for memory.
Chipper is offline   Reply With Quote
Old 08-04-2009, 08:23 AM   #3
totalnew
Member
 
Location: Canada

Join Date: Apr 2009
Posts: 46
Default

Hello, Chipper

You are right, I am only able to implement whole human genome by bwtsw. Guess bwa might not be competitive for sequencing SOLiD color space data.

Thanks
totalnew is offline   Reply With Quote
Old 08-04-2009, 11:54 AM   #4
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Sorry, I don't follow, why would it not be competetive for SOLiD data? It takes some time to build the index, but once you have the index it is really fast.
Chipper is offline   Reply With Quote
Old 08-04-2009, 12:38 PM   #5
totalnew
Member
 
Location: Canada

Join Date: Apr 2009
Posts: 46
Default

Sorry, I think I mislead in my reply, what I mean was that bwa couldn't index whole human genome in color space because bwtsw is the only way to do so. I can use -c for smaller genome, like chr1, ...etc.

Since some pipeline are using SOLiD data, I was thinking to generate the human genome index in color space and as you mentioned it is going to be fast once I have all those index files. So if now I want to align the color space data to human.fasta, I would have to pick another aligner?

thanks for your reply
totalnew is offline   Reply With Quote
Old 08-04-2009, 12:52 PM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by totalnew View Post
Sorry, I think I mislead in my reply, what I mean was that bwa couldn't index whole human genome in color space because bwtsw is the only way to do so. I can use -c for smaller genome, like chr1, ...etc.

Since some pipeline are using SOLiD data, I was thinking to generate the human genome index in color space and as you mentioned it is going to be fast once I have all those index files. So if now I want to align the color space data to human.fasta, I would have to pick another aligner?

thanks for your reply
I was able to index the entire human genome with BWA (bwtsw) so it is possible. I would like to certainly like to hear your experiences with longer read lengths with SOLiD data (50 and 75bp) and BWA. I have not gotten it to run as fast as other methods, especially when I try to have higher error tolerances (>10% color errors, and long indels).
nilshomer is offline   Reply With Quote
Old 08-04-2009, 01:30 PM   #7
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Nils,

I am testing it with a 50 bp dataset (23.6 M reads). As expected, aln without indels and few mismatches is very fast. 2 MM was done after ~ 15 minutes with 4 threads. 4 MM probably ~ 10 x slower but I like the option to allow more mismatches at the end (with a good seed) which should make it much faster. Would be nice to compare it to BFAST if I ever manage to build that index...

Any ideas on how to set up an ideal aligner comparison test for SOLiD data?
Chipper is offline   Reply With Quote
Old 08-04-2009, 01:38 PM   #8
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by Chipper View Post
Nils,

I am testing it with a 50 bp dataset (23.6 M reads). As expected, aln without indels and few mismatches is very fast. 2 MM was done after ~ 15 minutes with 4 threads. 4 MM probably ~ 10 x slower but I like the option to allow more mismatches at the end (with a good seed) which should make it much faster. Would be nice to compare it to BFAST if I ever manage to build that index...

Any ideas on how to set up an ideal aligner comparison test for SOLiD data?
Create some simulations is the best bet. I would create a dataset composing of sets of 10K reads, each with X SNPs, Y color errors, and a Z base long indel. You can then vary X, Y, and Z to see what power you really have to detect variants and to be robust to errors (10% color error rate is not unheard of). This is what I did with BFAST, which has its own synthetic read generator, when I compared it to other aligners.

I have found it takes about 6 hours to build one BFAST index on a 32GB quad-core machine. Like BWA, this needs to done only once per reference (save those indexes!). The BWA index I builit did not take too long to build either.
nilshomer is offline   Reply With Quote
Old 08-04-2009, 02:15 PM   #9
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Mapping with 6 MM with 2 in the seed (25 bp) is ~ 3x faster than with 4 MM in the full sequence. Will try with some recent datasets tomorrow.
Chipper is offline   Reply With Quote
Old 08-27-2009, 02:08 PM   #10
pliang
Junior Member
 
Location: Canada

Join Date: Aug 2009
Posts: 9
Default BWA: getting sequences like "GNNNNNNNNNNNNNNNNNNNNNNNNN" from SOLiD color space reads

Hello there,

I am new to this forum, but glad to see so many great discussions going on.

In the past, I have been mainly using MAQ to analyze the Solexa data. A few days ago, I started trying to use BWA to analyze the SOLiD data, partly because of its claimed fast speed, partly because of some of the problems I ran into when using MAQ for SOLiD data. I am running to a problem as described below. Just wonder if I can get some help from you experts. Thanks very much in advance! (sorry for having a long message as my first post, but I think it is necessary for you to understand the problem)

#Problem:
#I used the following commands trying to map pair-end SOLiD data in fastq format directly downloaded from the 1000 genome project site:

bwa index -a bwtsw -c hg18.fa &

[bwt_gen] Finished constructing BWT in 314 iterations.
[bwa_index] 2054.17 seconds elapse.
#This seem to work fine

bwa aln -c hg18.fa SRR003188_1.fastq >SRR003188_1.sai
bwa aln -c hg18.fa SRR003188_2.fastq >SRR003188_2.sai

#these are the files generated including the original read files:
-rw-r--r-- 1 pliang pliang 355680085 Aug 24 12:47 SRR003188_1.fastq
-rw-r--r-- 1 pliang pliang 11958400 Aug 26 21:19 SRR003188_1.sai
-rw-r--r-- 1 pliang pliang 355680085 Aug 24 12:49 SRR003188_2.fastq
-rw-r--r-- 1 pliang pliang 11958400 Aug 26 21:20 SRR003188_2.sai

bwa sampe -a 2400 hg18.fa SRR003188_1.sai SRR003188_2.sai SRR003188_1.fastq SRR003188_2.fastq >SRR003188.sam
#message
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] fail to infer insert size: too few good pairs
[bwa_sai2sam_pe_core] time elapses: 3.11 sec
[bwa_sai2sam_pe_core] change of coordinates in 0 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_sai2sam_pe_core] time elapses: 0.71 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 1.58 sec
[bwa_sai2sam_pe_core] print alignments... 0.43 sec
[bwa_sai2sam_pe_core] 262144 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] fail to infer insert size: too few good pairs

#when open the .sam file, it looks like this:
VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114 77 * 0 0 * * 0 0 GNNNNNNNNNNNNNNNNNNNNNNNNN !611%%(-+%*.&*.,&2,,'%(
)31
VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114 141 * 0 0 * * 0 0 TNNNNNNNNNNNNNNNNNNNNNNNNN !1:7%6);%.1/<%&717'/'7:
.....

#this was the same when run samse with single input. Looks like to me that the color space didn't get converted to properly, therefore not finding any match. Also, the time used for aln and sampe/samse seems to be too little to me.
pliang is offline   Reply With Quote
Old 08-27-2009, 10:38 PM   #11
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Hi,

bwa uses fastQ files with colors represented as ACGT, perhaps the 1000 genomes fastq files represents it as 0123? Also, bwa does not use the first color so you may have to strip it or use the solid2fastq script.
Chipper is offline   Reply With Quote
Old 08-28-2009, 06:51 AM   #12
pliang
Junior Member
 
Location: Canada

Join Date: Aug 2009
Posts: 9
Default

Hi Chipper, thanks for your response. Yes, you are right about the fastq files from the 1000 genome project. I didn't know the bwa uses only nucleotide sequence. So I will what you suggested and see how it goes. Thanks again.
pliang is offline   Reply With Quote
Old 12-20-2009, 11:28 PM   #13
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 199
Default

Quote:
Originally Posted by totalnew View Post
I like to build color-space indexing by bwa. The input fast should be in nucleotide space, so I use following command to index whole human genome:

>bwa index -c human.fasta

But segmentation fault occurred everytime like this,

[bwa_index] Pack nucleotide FASTA... 60.48 sec
[bwa_index] Convert nucleotide PAC to color PAC... 31.13 sec
[bwa_index] Reverse the packed sequence... 16.62 sec
[bwa_index] Construct BWT for the packed sequence...
Segmentation fault

Can anyone tell me why that happen?

thanks
Are there pre built indexes for BWA as there are for bowtie?
ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/

Last edited by KevinLam; 12-21-2009 at 05:47 PM.
KevinLam is offline   Reply With Quote
Old 02-03-2010, 02:03 PM   #14
jnfass
Member
 
Location: Davis, CA

Join Date: Aug 2008
Posts: 88
Default

though this is an old thread, it might be important to clarify ... are you referring to another tool called 'bwtsw', separate from bwa? Chipper was referring to the bwtsw indexing option to the 'bwa index' command ...
jnfass is offline   Reply With Quote
Old 03-08-2010, 11:55 AM   #15
bgulko
Member
 
Location: New York

Join Date: Nov 2009
Posts: 13
Default Converting NCBI colorspace fastq to BWA Colorspace Fastq.

I have a project that involves aligning SoLID data to Hg18. The short reads (both pair and single ended) are provided in a fastq file that looks like this

Code:
@SRR035457.1557068 VAB_solid0148_20090522_1_AZZ_ABT_LMP_pA_0000001003227942_AZZ_ABT_LMP_pA_000000100322794288_85_1730 length=50
T003.......0230..0.0.....220..2.010.301...321..111.
+SRR035457.1557068 VAB_solid0148_20090522_1_AZZ_ABT_LMP_pA_0000001003227942_AZZ_ABT_LMP_pA_000000100322794288_85_1730 length=50
!%9#!!!!!!!#-$1!!2!%!!!!!%)&!!(!*,#!$2'!!!)/+!!%2,!
Clearly this is colorspace data, and I'd like to use BWA to align it (I already have a suite of tools compatible with BWA, and this is near the end of the project, so I don't really want to switch).

The solid2fastq.pl script most often refrenced in BWA literature seems to require color space data in some kind of another format (multiple files, seperate quality and color data, perhaps different quality score scaling, etc...)

Can anyone provide some pointers as to how I can convert this colorspace FASTQ file to a colorspace FASTQ file that is compatible with BWA's colorspace aligner (presumably BWA's colorspace format represents colors using nucleotide letters... as opposed to converting the colorspace reads to actual nucleotides).

I want to make sure that I have
  • the correct colorspace name (0->A, 1->C, 2->G 3->T, *->N)
  • the correct quality score mapping and representation
  • allows paired reads to be correctly treated by BWA

Many thanks,
--Brad
bgulko is offline   Reply With Quote
Old 03-08-2010, 12:15 PM   #16
jnfass
Member
 
Location: Davis, CA

Join Date: Aug 2008
Posts: 88
Default @Brad: hack back to color representiation to use solid2fastq.pl

Hi Brad,

You're right about the file formats that solid2fastq is expecting ... there's a <prefix>_F3.csfasta and a <prefix>_F3_QV.qual file, and then a stats file (not needed by the script). But your fastq is still in numerical representation of colorspace, whereas the fastq that results from solid2fastq.pl is really still in colorspace, but the colors are represented (for convenience -- probably for the data structures in bwa?) as [ATCG].

So I'd recommend splitting/converting your fastq back into the csfasta and QV.qual files, and then using solid2fastq. For reference, I'll paste in the heads of some csfasta and qual files I have (below). However, I haven't yet dealt with PE SOLiD reads ... so I don't know how that's going to figure in your hacking. I don't see any "R3" references in your fastq example, which I thought was the reverse read nomenclature ... so I can't advise any further on that.

head solid0180sequencer_20091218_AS75_1_AS75_F3.csfasta
Code:
# Thu Dec 24 14:34:04 2009 /share/apps/corona/bin/filter_fasta.pl --output=/data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/results.F1B1/primary.20091224210709876 --name=solid0180sequencer_20091218_AS75_1_AS75 --tag=F3 --minlength=50 --prefix=T /data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/jobs/postPrimerSetPrimary.206/rawseq 
# Cwd: /home/pipeline
# Title: solid0180sequencer_20091218_AS75_1_AS75
>2_14_29_F3
T22301223212203112133123302220112001120102111112031
>2_14_37_F3
T00031010222021220101011121221111121121112121222212
>2_14_79_F3
T01231023120123320111231011110221102001010122213211
>2_14_97_F3
head solid0180sequencer_20091218_AS75_1_AS75_F3_QV.qual
Code:
# Thu Dec 24 14:34:04 2009 /share/apps/corona/bin/filter_fasta.pl --output=/data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/results.F1B1/primary.20091224210709876 --name=solid0180sequencer_20091218_AS75_1_AS75 --tag=F3 --minlength=50 --prefix=T /data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/jobs/postPrimerSetPrimary.206/rawseq 
# Cwd: /home/pipeline
# Title: solid0180sequencer_20091218_AS75_1_AS75
>2_14_29_F3
4 24 8 12 24 4 26 4 11 4 4 6 4 4 4 4 4 4 4 4 4 8 4 4 4 4 4 0 0 4 4 4 0 0 4 4 4 0 4 0 4 4 0 0 4 4 4 0 0 4 
>2_14_37_F3
6 13 6 7 6 7 20 4 4 4 13 18 4 4 4 12 9 4 4 4 4 20 4 4 4 5 11 0 4 4 4 5 4 4 4 4 5 0 0 4 4 4 0 0 4 4 4 0 0 0 
>2_14_79_F3
8 25 7 6 9 8 30 6 4 5 4 21 4 4 4 7 20 4 4 4 4 21 4 0 4 4 17 4 0 4 4 17 4 0 4 4 11 0 0 4 0 15 0 0 4 4 6 0 0 4 
>2_14_97_F3
hope that helps ...
~Joe
jnfass is offline   Reply With Quote
Old 04-06-2010, 11:49 AM   #17
bgulko
Member
 
Location: New York

Join Date: Nov 2009
Posts: 13
Default Problem Solved - For Now

I was able to obtain the original .csfasta and .qual files from the scientists that generated them and convert them using solid2fastq.pl from BWA v 0.5.7. Along the way I had to fix 3 minor bugs in the script. I'm surer there are better ways to do this (I am nto a perl programmer) but I have included the altered script in case it is of use to anyone else.

@Joe, Thanks for the examples and advice!

-- Brad

Code:
#!/usr/bin/perl -w

# Author: lh3
# Note: Ideally, this script should be written in C. It is a bit slow at present.
# Also note that this script is different from the one contained in MAQ.
# Three bugs fixed: bgulko 2010-04-06
#   1-replace -1 with 0 BEFORE trimming first quality value per line
#   2-ignore comment lines that start with #
#   3-ignore comments in name fields, ie characters after the "," in XXXX_YYYY_ZZZZ_F3,......

use strict;
use warnings;
use Getopt::Std;

my %opts;
my $version = '0.1.3';
my $usage = qq{
Usage: solid2fastq.pl <in.title> <out.prefix>

Note: <in.title> is the string showed in the `# Title:' line of a
      ".csfasta" read file. Then <in.title>F3.csfasta is read sequence
      file and <in.title>F3_QV.qual is the quality file. If
      <in.title>R3.csfasta is present, this script assumes reads are
      paired; otherwise reads will be regarded as single-end.

      The read name will be <out.prefix>:panel_x_y/[12] with `1' for R3
      tag and `2' for F3. Usually you may want to use short <out.prefix>
      to save diskspace. Long <out.prefix> also causes troubles to maq.

};

getopts('', \%opts);
die($usage) if (@ARGV != 2);
my ($title, $pre) = @ARGV;
my (@fhr, @fhw);
my @fn_suff = ('F3.csfasta', 'F3_QV.qual', 'R3.csfasta', 'R3_QV.qual');
my $is_paired = (-f "$title$fn_suff[2]" || -f "$title$fn_suff[2].gz")? 1 : 0;
if ($is_paired) { # paired end
  for (0 .. 3) {
	my $fn = "$title$fn_suff[$_]";
	$fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
	open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
  }
  open($fhw[0], "|gzip >$pre.read1.fastq.gz") || die; # this is NOT a typo
  open($fhw[1], "|gzip >$pre.read2.fastq.gz") || die;
  open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
  my (@df, @dr);
  @df = &read1(1); @dr = &read1(2);
  while (@df && @dr) {
	if ($df[0] eq $dr[0]) { # mate pair
	  print {$fhw[0]} $df[1]; print {$fhw[1]} $dr[1];
	  @df = &read1(1); @dr = &read1(2);
	} else {
	  if ($df[0] le $dr[0]) {
		print {$fhw[2]} $df[1];
		@df = &read1(1);
	  } else {
		print {$fhw[2]} $dr[1];
		@dr = &read1(2);
	  }
	}
  }
  if (@df) {
	print {$fhw[2]} $df[1];
	while (@df = &read1(1, $fhr[0], $fhr[1])) {
	  print {$fhw[2]} $df[1];
	}
  }
  if (@dr) {
	print {$fhw[2]} $dr[1];
	while (@dr = &read1(2, $fhr[2], $fhr[3])) {
	  print {$fhw[2]} $dr[1];
	}
  }
  close($fhr[$_]) for (0 .. $#fhr);
  close($fhw[$_]) for (0 .. $#fhw);
} else { # single end
  for (0 .. 1) {
	my $fn = "$title$fn_suff[$_]";
	$fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
	open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
  }
  open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
  my @df;
  while (@df = &read1(1, $fhr[0], $fhr[1])) {
	print {$fhw[2]} $df[1];
  }
  close($fhr[$_]) for (0 .. $#fhr);
  close($fhw[2]);
}

sub read1 {
  my $i = shift(@_);
  my $j = ($i-1)<<1;
  my ($key, $seq);
  my ($fhs, $fhq) = ($fhr[$j], $fhr[$j|1]);
  my ($a, $b, $c, $key2);
  while (<$fhs>) {
	while (substr( $_,0,1) eq '#') { <$fhs>; };
	my $t = <$fhq>;
	while (substr( $t,0,1) eq '#') { $t = <$fhq>; }
	if (/^>(\d+)_(\d+)_(\d+)_[FR]3/) {
	  $key = sprintf("%.4d_%.4d_%.4d", $1, $2, $3); # this line could be improved on 64-bit machines
          ( $a, $b, $c ) = split( /_/, substr($t,1) );
	  $key2 = sprintf("%.4d_%.4d_%.4d", $a, $b, $c); # this line could be improved on 64-bit machines
	  die(qq/** unmatched read name: '$_' != '$t' , '$key', '$key2'\n/) unless ($key eq $key2);
	  my $name = "$pre:$1_$2_$3/$i";
	  $_ = substr(<$fhs>, 2);
	  tr/0123./ACGTN/;
	  my $s = $_;
	  $_ = <$fhq>;
	  s/-1\b/0/eg;
	  s/^(\d+)\s*//;
	  s/(\d+)\s*/chr($1+33)/eg;
	  $seq = qq/\@$name\n$s+\n$_\n/;
	  last;
	}
  }
  return defined($seq)? ($key, $seq) : ();
}
bgulko is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:45 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO