I whipped up a couple of 'stupid' conversion scripts for base/colour space conversion that should work with a few common sequence formats that get thrown their way, as long as the sequences are on single lines.
To base space:
To colour-space:
So, I said these should work with most things that get thrown at them:
The output is a bit quirky, but might be sufficient as a substitute for doing things 'by hand'.
Note that these are slow due to heavy use of regular expression substitution, and a loop over the entire line for each substitution (in the worst case). The 'proper' way would be to iterate over the string from each line, and call a function that depends on the previous character and the current character.
To base space:
Code:
# cs2base.pl -- stupidly convert from colour-space to base-space # Author: David Eccles (gringer) 2011 <[email protected]> use warnings; use strict; while(<>){ if(!(/^[ACGT0123NX\.]+$/)){ print; next; } while(/[ACGT][0123]/){ s/A0/AA/;s/C0/CC/;s/G0/GG/;s/T0/TT/; s/A1/AC/;s/C1/CA/;s/G1/GT/;s/T1/TG/; s/A2/AG/;s/C2/CT/;s/G2/GA/;s/T2/TC/; s/A3/AT/;s/C3/CG/;s/G3/GC/;s/T3/TA/; } print; }
Code:
# base2cs.pl -- stupidly convert from base-space to colour-space # Author: David Eccles (gringer) 2011 <[email protected]> use warnings; use strict; while(<>){ if(!(/^[ACGT0123NX\.]+$/)){ print; next; } while(/[ACGT0123][ACGT]/){ s/AA([^ACGT])/A0$1/; s/CC([^ACGT])/C0$1/; s/GG([^ACGT])/G0$1/; s/TT([^ACGT])/T0$1/; ;s/AC([^ACGT])/A1$1/;s/AG([^ACGT])/A2$1/;s/AT([^ACGT])/A3$1/; s/CA([^ACGT])/C1$1/;;s/CG([^ACGT])/C3$1/;s/CT([^ACGT])/C2$1/; s/GA([^ACGT])/G2$1/;s/GC([^ACGT])/G3$1/;;s/GT([^ACGT])/G1$1/; s/TA([^ACGT])/T3$1/;s/TC([^ACGT])/T2$1/;s/TG([^ACGT])/T1$1/;; } print; }
Code:
$ echo "AGCGAGCTCAGCATCAGGCATCGACTAGCATCAACACTAC" | ~/scripts/base2cs.pl A233223221231321203132321232313210111231
Code:
$ head * | ~/scripts/base2cs.pl ==> 454_test.fasta <== >F6AJIXP02GO67R length=38 xy=2630_1077 region=2 run=R_2009_11_25_10_02_47_ T1010213211122112101320323211013121220 >F6AJIXP02GT62X length=48 xy=2687_0711 region=2 run=R_2009_11_25_10_02_47_ T30131301202131101133221101010100332322301212NG0 >F6AJIXP02FVOVL length=122 xy=2294_0543 region=2 run=R_2009_11_25_10_02_47_ T30220331212110313033112201023322020111132131122321201311301 T0101220220223232211130211010011010NC2322NA33NANA03301213332 T0 >F6AJIXP02JSYRL length=43 xy=3903_1391 region=2 run=R_2009_11_25_10_02_47_ T322330321330113123020110202310220011220032 ==> illumina_test.fasta <== >GAPC01_0005:7:1:1120:16293#0/1 C03011NA133100113311333300NNNNNNNNNN >GAPC01_0005:7:1:1120:10747#0/1 A02130003032213002030012021223310120 >GAPC01_0005:7:1:1120:20021#0/1 C00103010200003011103112111131201121 >GAPC01_0005:7:1:1120:11773#0/1 T33030102300300220300022022312021232 ==> phix_genome.fasta <== >gi|9626372|ref|NC_001422.1| Enterobacteria phage phiX174, complete genome G221000332332020131213312202103011120023023330022123122123200003033220 G233002312020303123121320110031320303003232021102121321033020003122200 A302321023320201333123223222023222031200133210200233013210123013230221 T210000121213331010231220222021103203033313201031131023121020212101003 G233312212111300011021310132222302222011012113000300022233110230312332 T122120323132110210101123033201302200321312212102103121201103203131310 T201221033200103022233030232213021203202213031000102300301032022312300 C323000221213221301100210010230132312121033222231132231233213310122032 T133100331013133210212001110023310022332002021322021101221003301321303 ==> test.fastq <== @H134:1:1201:1131:1970#0/1 C21220312301200NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN + babeeeeegggggghBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @H134:1:1201:1155:1987#0/1 C10132022222200322132222211013023110220311101320312 + b_beeeeegggggiiiighiiiiiiihihiiiihhfhghiiiiieegdghd @H134:1:1201:1086:1989#0/1 C03203203022110330222NG003330333220331201211330NA01
Note that these are slow due to heavy use of regular expression substitution, and a loop over the entire line for each substitution (in the worst case). The 'proper' way would be to iterate over the string from each line, and call a function that depends on the previous character and the current character.
Comment