![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
how to convert general fastq to fastq int format? | feng | Bioinformatics | 21 | 07-04-2014 12:40 AM |
For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq | byb121 | Bioinformatics | 6 | 12-20-2013 02:26 AM |
i converted illumina fastq into sanger fastq, need advice | Aicen | Bioinformatics | 5 | 08-27-2012 07:24 AM |
Convert illumina v1.5 fastq to sanger fastq | zouzou | Bioinformatics | 29 | 05-14-2012 10:07 PM |
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? | jjw14 | Illumina/Solexa | 2 | 06-01-2010 05:35 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: NYC Join Date: Aug 2009
Posts: 14
|
![]()
I want to do this in order to be able to use Bowtie software, but it doesn't support SOLiD data. Is there software to convert it?
|
![]() |
![]() |
![]() |
#2 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#3 |
Member
Location: NYC Join Date: Aug 2009
Posts: 14
|
![]()
Thanks, will BFAST work on a machine with <3 GBs of memory? (Mapping to whole mouse genome)
|
![]() |
![]() |
![]() |
#4 |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]()
Mouse genome is ~3Gb right? In either case, you will have to split the reference to get it to fit on your machine. This is easy and supported with BFAST, although 3Gb is not very much RAM at all. How many reads, what length, and what computational resources do you have?
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Sweden Join Date: Mar 2008
Posts: 324
|
![]()
BWA should work with < 3Gb and is probably the best choice if resources are limited. Other programs that indexes reads rather than the genome will also work but won't be as fast (MAQ, ZOOM!, SOCS?).
|
![]() |
![]() |
![]() |
#6 | |
Member
Location: PRC Join Date: May 2009
Posts: 33
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#7 | |
--Site Admin--
Location: SF Bay Area, CA, USA Join Date: Oct 2007
Posts: 1,358
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#8 | |
Member
Location: NYC Join Date: Aug 2009
Posts: 14
|
![]() Quote:
It seems you are actively still developing BFAST and there isn't much documentation so its hard to tell if it has all the options I will need for the data. I will want to map, assemble and and align back to the genome. |
|
![]() |
![]() |
![]() |
#9 |
Member
Location: NYC Join Date: Aug 2009
Posts: 14
|
![]() |
![]() |
![]() |
![]() |
#10 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Sweden Join Date: Mar 2008
Posts: 324
|
![]()
Converting will only work as long as you do not have any CS errors (1 CS error will change the rest of the sequence in NT space). The thing with the SOLID is that even with a relatively high CS error rate your mappings will have a very low error rate in NT space due to the 2-base encoding. This is why you need specialised software for the mappings, like BFAST or BWA, else you will get very few alignments.
|
![]() |
![]() |
![]() |
#12 |
Member
Location: PRC Join Date: May 2009
Posts: 33
|
![]()
If your SOLiD FASTQ like this below, you can try this script to convert to Solexa/Illumina FASTQ
Code:
@BARB_20071114_2_YorubanMP-BC3_3_16_150_F3 T0220100010131232212020122 +BARB_20071114_2_YorubanMP-BC3_3_16_150_F3 15 21 27 26 24 5 23 18 26 21 11 25 25 19 8 4 25 8 24 7 4 15 18 19 15 Last edited by BENM; 08-30-2009 at 03:15 AM. |
![]() |
![]() |
![]() |
#13 |
Junior Member
Location: Canada Join Date: Aug 2009
Posts: 9
|
![]()
Hi BENM,
Thanks for providing the perl script. I am using the SOLiD files from 1000 genome project, and data look like this: @VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114 G2203012023131303312303100 + !611%%(-+%*.&*.,&2,,'%()31 So with your script, the quality line got lost. Just wonder in this case the original quality line can be kept without any change other than removing the first char. I am new to SOLiD data, so want to double check with you. It may be useful for others if you can modify your script to accommodate this format. Thanks |
![]() |
![]() |
![]() |
#14 | |
Member
Location: PRC Join Date: May 2009
Posts: 33
|
![]() Quote:
Because samt's question is "Convert SOLiD fastq to Illumina fastq", Illumina FASTQ is different from Standard(Sanger) FASTQ in quality format. The syntax of Solexa/Illumina read format is almost identical to the FASTQ format, but the qualities are scaled differently. Given a character $sq, the following Perl code gives the Phred quality $Q: $Q = 10 * log(1 + 10 ** (ord($sq) - 64) / 10.0)) / log(10); The ASCII charactars in Solexa FASTQ means: Code:
CHAR DEC QUALITY A 65 1 B 66 2 C 67 3 D 68 4 E 69 5 F 70 6 G 71 7 H 72 8 I 73 9 J 74 10 K 75 11 L 76 12 M 77 13 N 78 14 O 79 15 P 80 16 Q 81 17 R 82 18 S 83 19 T 84 20 U 85 21 V 86 22 W 87 23 X 88 24 Y 89 25 Z 90 26 [ 91 27 \ 92 28 ] 93 29 ^ 94 30 _ 95 31 ` 96 32 a 97 33 b 98 34 c 99 35 d 100 36 e 101 37 f 102 38 g 103 39 h 104 40 ; 59 -5 < 60 -4 = 61 -3 > 62 -2 ? 63 -1 @ 64 0 Code:
CHAR DEC QUALITY ! 0 -64 ! 1 -63 ! 2 -62 ! 3 -61 ! 4 -60 ! 5 -59 ! 6 -58 ! 7 -57 ! 8 -56 ! 9 -55 ! 10 -54 ! 11 -53 ! 12 -52 ! 13 -51 ! 14 -50 ! 15 -49 ! 16 -48 ! 17 -47 ! 18 -46 ! 19 -45 ! 20 -44 ! 21 -43 ! 22 -42 ! 23 -41 ! 24 -40 ! 25 -39 ! 26 -38 ! 27 -37 ! 28 -36 ! 29 -35 ! 30 -34 ! 31 -33 ! 32 -32 ! 33 -31 ! 34 -30 ! 35 -29 ! 36 -28 ! 37 -27 ! 38 -26 ! 39 -25 ! 40 -24 ! 41 -23 ! 42 -22 ! 43 -21 ! 44 -20 ! 45 -19 ! 46 -18 ! 47 -17 ! 48 -16 ! 49 -15 ! 50 -14 ! 51 -13 ! 52 -12 ! 53 -11 ! 54 -10 " 55 -9 " 56 -8 " 57 -7 " 58 -6 " 59 -5 " 60 -4 # 61 -3 # 62 -2 $ 63 -1 $ 64 0 % 65 1 % 66 2 & 67 3 & 68 4 ' 69 5 ( 70 6 ) 71 7 * 72 8 + 73 9 + 74 10 , 75 11 - 76 12 . 77 13 / 78 14 0 79 15 1 80 16 2 81 17 3 82 18 4 83 19 5 84 20 6 85 21 7 86 22 8 87 23 9 88 24 : 89 25 ; 90 26 < 91 27 = 92 28 > 93 29 ? 94 30 @ 95 31 A 96 32 B 97 33 C 98 34 D 99 35 E 100 36 F 101 37 G 102 38 H 103 39 I 104 40 J 105 41 K 106 42 L 107 43 M 108 44 N 109 45 O 110 46 P 111 47 Q 112 48 R 113 49 S 114 50 T 115 51 U 116 52 V 117 53 W 118 54 X 119 55 Y 120 56 Z 121 57 [ 122 58 \ 123 59 ] 124 60 ^ 125 61 _ 126 62 ` 127 63 a 128 64 # Solexa->Sanger quality conversion table my @conv_table; for (-64..64) { $conv_table[$_+64] = chr(int(33 + 10*log(1+10**($_/10.0))/log(10)+.499)); } I am trying to write a universal script for Solexa/Illumina, SOLiD/ABi, 454/Roche, 3730/Sanger,...transforming to each other format for different purpose, but I need to know your requirements, after that, I will share it to you all. Hope I answer your question. BTW I attach the SOLiD2std.pl for your question, just make a little change in SOLiD2Solexa.pl Last edited by BENM; 03-26-2012 at 08:40 PM. |
|
![]() |
![]() |
![]() |
#15 |
Junior Member
Location: Canada Join Date: Aug 2009
Posts: 9
|
![]()
Hi BENM:
Thank you for response with the new information. It happens that I need to convert the SOLiD color space sequence in fastq to Solexa format for its sequence and quality format. I believe the quality score is already in the AscII scheme (see the copied sequence entry in my first email), that is why I thought that that quality score line can be kept without change for my use. Am I right about this? In any case, I think tool for converting among different format of the data from different platform can be useful for us. Thanks again? |
![]() |
![]() |
![]() |
#16 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
Why would you want to convert color space to sequence space before alignment? Basically, why do you want SOLiD color space data in Illumina format? Bowtie does not work with color space (yet) and no amount of "input hacking" will get it to work right now. |
|
![]() |
![]() |
![]() |
#17 |
Member
Location: LONDON, UNITED KINGDOM Join Date: Jan 2009
Posts: 44
|
![]()
Hi BENM
I edited your SOLid2Std.pl script to include some extra colorspace mapping code that was not considered originally. I wanted to include the following basespace mapping: Basically any base(ATCG) that includes a '4' '5' or '.' is 'N'. 'N' to 'N' transition is also represented by diferent color space numbers (0,1,2,3,6,'.'). Code:
A4:N A.: N A5:N C4:N C.: N C5:N G4:N G.: N G5:N T4: N T.: N T5: N N5: A C T or G N.: N N6: N N0: N N1: N N2: N N3: N I edited the following part: Code:
# SOLiD color code my @code = ([0,1,2,3,'.',4,5],[1,0,3,2,'.',4,5],[2,3,0,1,'.',4,5],[3,2,1,0,'.',4,5],[5,5,5,5,'.',6,0],[5,5,5,5,1,2,3],[5,5,5,5,1,2,3]); my @bases = qw(A C G T N N N); my %decode = (); foreach my $i(0..7) { foreach my $j(0..7) { $decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j]; } } However there is an error message when I run the script, although the error does not prevent it from working, which is good. perl gives me the following error message: Code:
Use of uninitialized value in hash element at SOLid2Std.pl line 49. Use of uninitialized value within @bases in hash element at SOLid2Std.pl line 49 . line 49 is: Code:
$decode{$code[$i]->[$j]} -> {$bases[$i]} = $bases[$j]; Last edited by inesdesantiago; 10-05-2009 at 07:57 AM. Reason: to mention line 49 |
![]() |
![]() |
![]() |
#18 | |
Member
Location: PRC Join Date: May 2009
Posts: 33
|
![]() Quote:
Thank you for your opinions. In color space if one color space can't be recognized by SOLiD™ System, it will cause the rear bases uncertain too. So, the reads will decode "N" instead of other base in conveting color space to nucleic acid base. For expample: Code:
@example1 G2203012023131303312303100 + !611%%(-+%*.&*.,&2,,'%()31 @example2 G220301.023131303312303100 + !611%%(-+%*.&*.,&2,,'%()31 @example3 G2203012023141303312303100 + !611%%(-+%*.&*.,&2,,'%()31 @example4 G2203012023151303512303100 + !611%%(-+%*.&*.,&2,,'%()31 Code:
A4:N A.: N A5:N C4:N C.: N C5:N G4:N G.: N G5:N T4: N T.: N T5: N N5: A C T or G N.: N N6: N N0: N N1: N N2: N N3: N Code:
@example1 AGGCCAGGATGCATTATGATTACCC + 611%%(-+%*.&*.,&2,,'%()31 @example2 AGGCCANNNNNNNNNNNNNNNNNNN + 611%%(-+%*.&*.,&2,,'%()31 @example3 AGGCCAGGATGNNNNNNNNNNNNNN + 611%%(-+%*.&*.,&2,,'%()31 @example4 AGGCCAGGATGNNNNNGTCGGCAAA + 611%%(-+%*.&*.,&2,,'%()31 Code:
$current_base = $decode{$colors[$i]}->{$last_base}; Code:
if (($last_base=~/N/i)&&($colors[$i]==5)) { $current_base = $bases[int(rand(@bases))]; } else { $current_base = (exists $decode{$colors[$i]}->{$last_base}) ? $decode{$colors[$i]}->{$last_base} : "N"; } BTW, because SOLiD reads are short, ultra short, most of pepople will abandon these reads which cotain ".456" in color space. I think it is acceptable for SOLiD™ System ultra high throughput, we don't need these uncertain or low quality reads. Last edited by BENM; 10-06-2009 at 10:34 PM. |
|
![]() |
![]() |
![]() |
#19 |
Member
Location: Beijing Join Date: Sep 2009
Posts: 17
|
![]()
Hi BENN
Thank you for your script. I tried your SOLiD2Std.pl with my following data like this: @exa1 T1011122220100230032132.2111111002.1 + !)+%.*%*+2'0%%%-%+%*5'%!%9+'%+<+0%!% @exa2 T0101233211103200232333.2111211002.1 + !,.+'+')'390%%%%%%%'%%%!-<++++<99%!% @exa3 T0312202213101213131111.1110131102.1 + !93<*/18+%:9%+075*%:;+6!3<26%/<%-%!% and the result is like this: @exa1 GGTGTCTCTTGGGATTTAGTAGNNNNNNNNNNNNN + )+%.*%*+2'0%%%-%+%*5'%!%9+'%+<+0%!% @exa2 TGGTCGCTGTGGCTTTCGATATNNNNNNNNNNNNN + ,.+'+')'390%%%%%%%'%%%!-<++++<99%!% @exa3 TACTCCTCATGGTCATGCACACNNNNNNNNNNNNN + 93<*/18+%:9%+075*%:;+6!3<26%/<%-%!% It seems that all letters will be converted into "N" from the first dot "." Is that all right? Thank you. |
![]() |
![]() |
![]() |
#20 | |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|