SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > SOLiD



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to convert general fastq to fastq int format? feng Bioinformatics 21 07-03-2014 11:40 PM
For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq byb121 Bioinformatics 6 12-20-2013 01:26 AM
i converted illumina fastq into sanger fastq, need advice Aicen Bioinformatics 5 08-27-2012 06:24 AM
Convert illumina v1.5 fastq to sanger fastq zouzou Bioinformatics 29 05-14-2012 09:07 PM
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? jjw14 Illumina/Solexa 2 06-01-2010 04:35 PM

Reply
 
Thread Tools
Old 08-12-2009, 08:20 PM   #1
samt
Member
 
Location: NYC

Join Date: Aug 2009
Posts: 14
Default Convert SOLiD fastq to Illumina fastq

I want to do this in order to be able to use Bowtie software, but it doesn't support SOLiD data. Is there software to convert it?
samt is offline   Reply With Quote
Old 08-12-2009, 08:32 PM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by samt View Post
I want to do this in order to be able to use Bowtie software, but it doesn't support SOLiD data. Is there software to convert it?
If there was a conversion process, it would by definition support SOLiD data. It does not, so if you try to use it, even with double-encoded data, you will get poor results. Try BFAST or BWA for SOLiD data (you could also try SHRiMP).
nilshomer is offline   Reply With Quote
Old 08-12-2009, 08:37 PM   #3
samt
Member
 
Location: NYC

Join Date: Aug 2009
Posts: 14
Default

Thanks, will BFAST work on a machine with <3 GBs of memory? (Mapping to whole mouse genome)
samt is offline   Reply With Quote
Old 08-12-2009, 08:47 PM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by samt View Post
Thanks, will BFAST work on a machine with <3 GBs of memory? (Mapping to whole mouse genome)
Mouse genome is ~3Gb right? In either case, you will have to split the reference to get it to fit on your machine. This is easy and supported with BFAST, although 3Gb is not very much RAM at all. How many reads, what length, and what computational resources do you have?
nilshomer is offline   Reply With Quote
Old 08-12-2009, 11:00 PM   #5
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Quote:
Originally Posted by samt View Post
Thanks, will BFAST work on a machine with <3 GBs of memory? (Mapping to whole mouse genome)
BWA should work with < 3Gb and is probably the best choice if resources are limited. Other programs that indexes reads rather than the genome will also work but won't be as fast (MAQ, ZOOM!, SOCS?).
Chipper is offline   Reply With Quote
Old 08-13-2009, 01:48 AM   #6
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

Quote:
Originally Posted by samt View Post
I want to do this in order to be able to use Bowtie software, but it doesn't support SOLiD data. Is there software to convert it?
It is very easy to do that, maybe I can help you, you can reach me at: benm.fbx@gmail.com
BENM is offline   Reply With Quote
Old 08-13-2009, 06:46 AM   #7
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,358
Default

Quote:
Originally Posted by BENM View Post
It is very easy to do that, maybe I can help you, you can reach me at: benm.fbx@gmail.com
It's easy to jump off a building, doesn't mean one should.
ECO is offline   Reply With Quote
Old 08-13-2009, 04:28 PM   #8
samt
Member
 
Location: NYC

Join Date: Aug 2009
Posts: 14
Default

Quote:
Originally Posted by nilshomer View Post
Mouse genome is ~3Gb right? In either case, you will have to split the reference to get it to fit on your machine. This is easy and supported with BFAST, although 3Gb is not very much RAM at all. How many reads, what length, and what computational resources do you have?
~100 million reads, 34 bps (SOliD), I was hoping to use my machine but I have a powerful enough cluster as well.

It seems you are actively still developing BFAST and there isn't much documentation so its hard to tell if it has all the options I will need for the data. I will want to map, assemble and and align back to the genome.
samt is offline   Reply With Quote
Old 08-13-2009, 04:29 PM   #9
samt
Member
 
Location: NYC

Join Date: Aug 2009
Posts: 14
Default

Quote:
Originally Posted by ECO View Post
It's easy to jump off a building, doesn't mean one should.
I'm not worried about calling SNPS, just obtaining a consensus sequence mapped to the genome. Would converting to NT from CS still be a bad choice?
samt is offline   Reply With Quote
Old 08-13-2009, 04:34 PM   #10
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by samt View Post
~100 million reads, 34 bps (SOliD), I was hoping to use my machine but I have a powerful enough cluster as well.

It seems you are actively still developing BFAST and there isn't much documentation so its hard to tell if it has all the options I will need for the data. I will want to map, assemble and and align back to the genome.
There is a very complete reference manual that comes with the distribution. Please see download instructions at http://genome.ucla.edu/bfast
nilshomer is offline   Reply With Quote
Old 08-14-2009, 01:21 AM   #11
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Quote:
Originally Posted by samt View Post
I'm not worried about calling SNPS, just obtaining a consensus sequence mapped to the genome. Would converting to NT from CS still be a bad choice?
Converting will only work as long as you do not have any CS errors (1 CS error will change the rest of the sequence in NT space). The thing with the SOLID is that even with a relatively high CS error rate your mappings will have a very low error rate in NT space due to the 2-base encoding. This is why you need specialised software for the mappings, like BFAST or BWA, else you will get very few alignments.
Chipper is offline   Reply With Quote
Old 08-28-2009, 12:52 AM   #12
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

If your SOLiD FASTQ like this below, you can try this script to convert to Solexa/Illumina FASTQ
Code:
@BARB_20071114_2_YorubanMP-BC3_3_16_150_F3
T0220100010131232212020122
+BARB_20071114_2_YorubanMP-BC3_3_16_150_F3
15 21 27 26 24 5 23 18 26 21 11 25 25 19 8 4 25 8 24 7 4 15 18 19 15
Attached Files
File Type: pl SOLid2Solexa.pl (4.0 KB, 536 views)

Last edited by BENM; 08-30-2009 at 02:15 AM.
BENM is offline   Reply With Quote
Old 08-28-2009, 07:38 AM   #13
pliang
Junior Member
 
Location: Canada

Join Date: Aug 2009
Posts: 9
Default

Hi BENM,

Thanks for providing the perl script. I am using the SOLiD files from 1000 genome project, and data look like this:

@VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114
G2203012023131303312303100
+
!611%%(-+%*.&*.,&2,,'%()31

So with your script, the quality line got lost. Just wonder in this case the original quality line can be kept without any change other than removing the first char. I am new to SOLiD data, so want to double check with you. It may be useful for others if you can modify your script to accommodate this format.

Thanks
pliang is offline   Reply With Quote
Old 08-30-2009, 02:16 AM   #14
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

Quote:
Originally Posted by pliang View Post
Hi BENM,

Thanks for providing the perl script. I am using the SOLiD files from 1000 genome project, and data look like this:

@VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114
G2203012023131303312303100
+
!611%%(-+%*.&*.,&2,,'%()31

So with your script, the quality line got lost. Just wonder in this case the original quality line can be kept without any change other than removing the first char. I am new to SOLiD data, so want to double check with you. It may be useful for others if you can modify your script to accommodate this format.

Thanks
Hi, pliang

Because samt's question is "Convert SOLiD fastq to Illumina fastq", Illumina FASTQ is different from Standard(Sanger) FASTQ in quality format.

The syntax of Solexa/Illumina read format is almost identical to the FASTQ format, but the qualities are scaled differently. Given a character $sq, the following Perl code gives the Phred quality $Q:

$Q = 10 * log(1 + 10 ** (ord($sq) - 64) / 10.0)) / log(10);

The ASCII charactars in Solexa FASTQ means:
Code:
CHAR	DEC	QUALITY
A	65	1
B	66	2
C	67	3
D	68	4
E	69	5
F	70	6
G	71	7
H	72	8
I	73	9
J	74	10
K	75	11
L	76	12
M	77	13
N	78	14
O	79	15
P	80	16
Q	81	17
R	82	18
S	83	19
T	84	20
U	85	21
V	86	22
W	87	23
X	88	24
Y	89	25
Z	90	26
[	91	27
\	92	28
]	93	29
^	94	30
_	95	31
`	96	32
a	97	33
b	98	34
c	99	35
d	100	36
e	101	37
f	102	38
g	103	39
h	104	40
;	59	-5
<	60	-4
=	61	-3
>	62	-2
?	63	-1
@	64	0
In contrast to Solexa FASTQ quality, the ASCII characters in standard (sanger) FASTQ, it used to denote:
Code:
CHAR	DEC	QUALITY
!       0       -64
!       1       -63
!       2       -62
!       3       -61
!       4       -60
!       5       -59
!       6       -58
!       7       -57
!       8       -56
!       9       -55
!       10      -54
!       11      -53
!       12      -52
!       13      -51
!       14      -50
!       15      -49
!       16      -48
!       17      -47
!       18      -46
!       19      -45
!       20      -44
!       21      -43
!       22      -42
!       23      -41
!       24      -40
!       25      -39
!       26      -38
!       27      -37
!       28      -36
!       29      -35
!       30      -34
!       31      -33
!       32      -32
!       33      -31
!       34      -30
!       35      -29
!       36      -28
!       37      -27
!       38      -26
!       39      -25
!       40      -24
!       41      -23
!       42      -22
!       43      -21
!       44      -20
!       45      -19
!       46      -18
!       47      -17
!       48      -16
!       49      -15
!       50      -14
!       51      -13
!       52      -12
!       53      -11
!       54      -10
"       55      -9
"       56      -8
"       57      -7
"       58      -6
"       59      -5
"       60      -4
#       61      -3
#       62      -2
$       63      -1
$       64      0
%       65      1
%       66      2
&       67      3
&       68      4
'       69      5
(       70      6
)       71      7
*       72      8
+       73      9
+       74      10
,       75      11
-       76      12
.       77      13
/       78      14
0       79      15
1       80      16
2       81      17
3       82      18
4       83      19
5       84      20
6       85      21
7       86      22
8       87      23
9       88      24
:       89      25
;       90      26
<       91      27
=       92      28
>       93      29
?       94      30
@       95      31
A       96      32
B       97      33
C       98      34
D       99      35
E       100     36
F       101     37
G       102     38
H       103     39
I       104     40
J       105     41
K       106     42
L       107     43
M       108     44
N       109     45
O       110     46
P       111     47
Q       112     48
R       113     49
S       114     50
T       115     51
U       116     52
V       117     53
W       118     54
X       119     55
Y       120     56
Z       121     57
[       122     58
\       123     59
]       124     60
^       125     61
_       126     62
`       127     63
a       128     64
So it is easy to conver Solexa->Sanger quality, you just need to build a conversion table in PERL script, just like this:
# Solexa->Sanger quality conversion table
my @conv_table;
for (-64..64) {
$conv_table[$_+64] = chr(int(33 + 10*log(1+10**($_/10.0))/log(10)+.499));
}

I am trying to write a universal script for Solexa/Illumina, SOLiD/ABi, 454/Roche, 3730/Sanger,...transforming to each other format for different purpose, but I need to know your requirements, after that, I will share it to you all.

Hope I answer your question.
BTW I attach the SOLiD2std.pl for your question, just make a little change in SOLiD2Solexa.pl
Attached Files
File Type: pl SOLiD2Std.pl (5.2 KB, 217 views)

Last edited by BENM; 03-26-2012 at 07:40 PM.
BENM is offline   Reply With Quote
Old 08-30-2009, 08:19 PM   #15
pliang
Junior Member
 
Location: Canada

Join Date: Aug 2009
Posts: 9
Default

Hi BENM:

Thank you for response with the new information. It happens that I need to convert the SOLiD color space sequence in fastq to Solexa format for its sequence and quality format. I believe the quality score is already in the AscII scheme (see the copied sequence entry in my first email), that is why I thought that that quality score line can be kept without change for my use. Am I right about this? In any case, I think tool for converting among different format of the data from different platform can be useful for us. Thanks again?
pliang is offline   Reply With Quote
Old 08-30-2009, 10:23 PM   #16
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by pliang View Post
Hi BENM:

Thank you for response with the new information. It happens that I need to convert the SOLiD color space sequence in fastq to Solexa format for its sequence and quality format. I believe the quality score is already in the AscII scheme (see the copied sequence entry in my first email), that is why I thought that that quality score line can be kept without change for my use. Am I right about this? In any case, I think tool for converting among different format of the data from different platform can be useful for us. Thanks again?
Re pliang:

Why would you want to convert color space to sequence space before alignment? Basically, why do you want SOLiD color space data in Illumina format? Bowtie does not work with color space (yet) and no amount of "input hacking" will get it to work right now.
nilshomer is offline   Reply With Quote
Old 10-05-2009, 06:54 AM   #17
inesdesantiago
Member
 
Location: LONDON, UNITED KINGDOM

Join Date: Jan 2009
Posts: 44
Default Editing SOLiD2Std.pl to include more colorspace->basespace

Hi BENM

I edited your SOLid2Std.pl script to include some extra colorspace mapping code that was not considered originally.

I wanted to include the following basespace mapping:
Basically any base(ATCG) that includes a '4' '5' or '.' is 'N'. 'N' to 'N' transition is also represented by diferent color space numbers (0,1,2,3,6,'.').

Code:
A4:N 
A.: N
A5:N
C4:N
C.: N
C5:N
G4:N
G.: N
G5:N
T4: N
T.:  N
T5: N
N5: A C T or G
N.:  N
N6: N
N0: N
N1: N
N2: N
N3: N
I am very naive in perl but I tried to change your script to include this conversion.
I edited the following part:

Code:
# SOLiD color code
my @code = ([0,1,2,3,'.',4,5],[1,0,3,2,'.',4,5],[2,3,0,1,'.',4,5],[3,2,1,0,'.',4,5],[5,5,5,5,'.',6,0],[5,5,5,5,1,2,3],[5,5,5,5,1,2,3]);
my @bases = qw(A C G T N N N);
my %decode = ();
foreach my $i(0..7)
{
	foreach my $j(0..7)
	{
		$decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j];
	}
}
It works!
However there is an error message when I run the script, although the error does not prevent it from working, which is good.
perl gives me the following error message:

Code:
Use of uninitialized value in hash element at SOLid2Std.pl line 49.
Use of uninitialized value within @bases in hash element at SOLid2Std.pl line 49
.
What am I doing wrong?
line 49 is:
Code:
$decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j];
Thank You

Last edited by inesdesantiago; 10-05-2009 at 06:57 AM. Reason: to mention line 49
inesdesantiago is offline   Reply With Quote
Old 10-05-2009, 09:35 PM   #18
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

Quote:
Originally Posted by inesdesantiago View Post
Hi BENM

I edited your SOLid2Std.pl script to include some extra colorspace mapping code that was not considered originally.

I wanted to include the following basespace mapping:
Basically any base(ATCG) that includes a '4' '5' or '.' is 'N'. 'N' to 'N' transition is also represented by diferent color space numbers (0,1,2,3,6,'.').

Code:
A4:N 
A.: N
A5:N
C4:N
C.: N
C5:N
G4:N
G.: N
G5:N
T4: N
T.:  N
T5: N
N5: A C T or G
N.:  N
N6: N
N0: N
N1: N
N2: N
N3: N
I am very naive in perl but I tried to change your script to include this conversion.
I edited the following part:

Code:
# SOLiD color code
my @code = ([0,1,2,3,'.',4,5],[1,0,3,2,'.',4,5],[2,3,0,1,'.',4,5],[3,2,1,0,'.',4,5],[5,5,5,5,'.',6,0],[5,5,5,5,1,2,3],[5,5,5,5,1,2,3]);
my @bases = qw(A C G T N N N);
my %decode = ();
foreach my $i(0..7)
{
	foreach my $j(0..7)
	{
		$decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j];
	}
}
It works!
However there is an error message when I run the script, although the error does not prevent it from working, which is good.
perl gives me the following error message:

Code:
Use of uninitialized value in hash element at SOLid2Std.pl line 49.
Use of uninitialized value within @bases in hash element at SOLid2Std.pl line 49
.
What am I doing wrong?
line 49 is:
Code:
$decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j];
Thank You
Hi inesdesantiago

Thank you for your opinions. In color space if one color space can't be recognized by SOLiD™ System, it will cause the rear bases uncertain too. So, the reads will decode "N" instead of other base in conveting color space to nucleic acid base. For expample:

Code:
@example1
G2203012023131303312303100
+
!611%%(-+%*.&*.,&2,,'%()31
@example2
G220301.023131303312303100
+
!611%%(-+%*.&*.,&2,,'%()31
@example3
G2203012023141303312303100
+
!611%%(-+%*.&*.,&2,,'%()31
@example4
G2203012023151303512303100
+
!611%%(-+%*.&*.,&2,,'%()31
There is a dot in expample2 reads, "4" is present in example3 reads, "5" exists in exaple4 reads, so after it will be convert to "N" by your principle:
Code:
A4:N 
A.: N
A5:N
C4:N
C.: N
C5:N
G4:N
G.: N
G5:N
T4: N
T.:  N
T5: N
N5: A C T or G
N.:  N
N6: N
N0: N
N1: N
N2: N
N3: N
as like that:
Code:
@example1
AGGCCAGGATGCATTATGATTACCC
+
611%%(-+%*.&*.,&2,,'%()31
@example2
AGGCCANNNNNNNNNNNNNNNNNNN
+
611%%(-+%*.&*.,&2,,'%()31
@example3
AGGCCAGGATGNNNNNNNNNNNNNN
+
611%%(-+%*.&*.,&2,,'%()31
@example4
AGGCCAGGATGNNNNNGTCGGCAAA
+
611%%(-+%*.&*.,&2,,'%()31
Then you don't need to change the "# SOLiD color code" part in the script. You just need to modify the line 169:
Code:
	$current_base = $decode{$colors[$i]}->{$last_base};
change it to:
Code:
if (($last_base=~/N/i)&&($colors[$i]==5))
{
	$current_base = $bases[int(rand(@bases))];
}
else
{
	$current_base = (exists $decode{$colors[$i]}->{$last_base}) ? $decode{$colors[$i]}->{$last_base} : "N";
}
It is easier than your ways.

BTW, because SOLiD reads are short, ultra short, most of pepople will abandon these reads which cotain ".456" in color space. I think it is acceptable for SOLiD™ System ultra high throughput, we don't need these uncertain or low quality reads.
Attached Files
File Type: pl SOLiD2Std.pl (4.0 KB, 300 views)

Last edited by BENM; 10-06-2009 at 09:34 PM.
BENM is offline   Reply With Quote
Old 12-24-2009, 03:09 AM   #19
lix
Member
 
Location: Beijing

Join Date: Sep 2009
Posts: 17
Default

Hi BENN

Thank you for your script. I tried your SOLiD2Std.pl with my following data like this:

@exa1
T1011122220100230032132.2111111002.1
+
!)+%.*%*+2'0%%%-%+%*5'%!%9+'%+<+0%!%
@exa2
T0101233211103200232333.2111211002.1
+
!,.+'+')'390%%%%%%%'%%%!-<++++<99%!%
@exa3
T0312202213101213131111.1110131102.1
+
!93<*/18+%:9%+075*%:;+6!3<26%/<%-%!%


and the result is like this:

@exa1
GGTGTCTCTTGGGATTTAGTAGNNNNNNNNNNNNN
+
)+%.*%*+2'0%%%-%+%*5'%!%9+'%+<+0%!%
@exa2
TGGTCGCTGTGGCTTTCGATATNNNNNNNNNNNNN
+
,.+'+')'390%%%%%%%'%%%!-<++++<99%!%
@exa3
TACTCCTCATGGTCATGCACACNNNNNNNNNNNNN
+
93<*/18+%:9%+075*%:;+6!3<26%/<%-%!%

It seems that all letters will be converted into "N" from the first dot "."
Is that all right?

Thank you.
lix is offline   Reply With Quote
Old 12-24-2009, 07:06 AM   #20
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by lix View Post
Hi BENN

Thank you for your script. I tried your SOLiD2Std.pl with my following data like this:

@exa1
T1011122220100230032132.2111111002.1
+
!)+%.*%*+2'0%%%-%+%*5'%!%9+'%+<+0%!%
@exa2
T0101233211103200232333.2111211002.1
+
!,.+'+')'390%%%%%%%'%%%!-<++++<99%!%
@exa3
T0312202213101213131111.1110131102.1
+
!93<*/18+%:9%+075*%:;+6!3<26%/<%-%!%


and the result is like this:

@exa1
GGTGTCTCTTGGGATTTAGTAGNNNNNNNNNNNNN
+
)+%.*%*+2'0%%%-%+%*5'%!%9+'%+<+0%!%
@exa2
TGGTCGCTGTGGCTTTCGATATNNNNNNNNNNNNN
+
,.+'+')'390%%%%%%%'%%%!-<++++<99%!%
@exa3
TACTCCTCATGGTCATGCACACNNNNNNNNNNNNN
+
93<*/18+%:9%+075*%:;+6!3<26%/<%-%!%

It seems that all letters will be converted into "N" from the first dot "."
Is that all right?

Thank you.
Without aligning (i.e. knowing the decoded DNA seqence), a missing base will not allow for a deterministic decoding (there are actually four possible sequences after a missing base).
nilshomer is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:39 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO