![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
.abi to fasta/fastq conversion script/program? | AppleInformatics | General | 12 | 08-26-2012 11:17 PM |
csfasta to fasta? | brachysclereid | Bioinformatics | 5 | 08-31-2011 10:27 AM |
EMBL like file to FASTA conversion.. | empyrean | Bioinformatics | 1 | 05-14-2011 01:49 AM |
fastq to fasta conversion | kwtennis311 | Bioinformatics | 4 | 06-11-2010 12:06 PM |
Fasta to Ace conversion | Farhat | Bioinformatics | 19 | 05-15-2010 07:08 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: USA Join Date: Jan 2009
Posts: 96
|
![]()
I have a fast trivial question:
what's the fastest/easier way to "decode" or convert the csfasta to fasta? I'm just doing this for a handful at a time for code-checking. thanks in advance. |
![]() |
![]() |
![]() |
#2 |
Member
Location: Cambridge, MA Join Date: Feb 2008
Posts: 82
|
![]()
Are you looking to benchmark methods or just need to decode a set of sequences? Contact me directly if you would like a python module for SOLiD sequence manipulation(s) including csfasta --> fasta.
|
![]() |
![]() |
![]() |
#3 |
Member
Location: India Join Date: Oct 2008
Posts: 36
|
![]()
You mean converting colorspace seq.. to basespace seq...
|
![]() |
![]() |
![]() |
#4 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
The ABI 'corona lite' programs (which are free) include 'encodeFasta.py' which will encode and decode to/from color-space, base-space and that abomination 'double-encoded'-space.
|
![]() |
![]() |
![]() |
#5 |
Member
Location: USA Join Date: Jan 2009
Posts: 96
|
![]() |
![]() |
![]() |
![]() |
#6 | |
Member
Location: US Join Date: Apr 2009
Posts: 52
|
![]() Quote:
ImportError: No module named agapython.util.Dibase Where do I get the module? I run both code on Linux(ubuntu) and mac terminal, neither work |
|
![]() |
![]() |
![]() |
#7 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
The module should come with corona lite. I suspect that you do not have your corona lite setup environment set up properly. From the README:
3) Configure your environment * For csh/tcsh: % setenv CORONAROOT <INSTALL_DIR>/corona_lite % source $CORONAROOT/etc/profile.d/corona.csh For sh/ksh/bash: %export CORONAROOT=<INSTALL_DIR>/corona_lite %source $CORONAROOT/etc/profile.d/corona.sh * Remember to update your shell's init script (.cshrc, .bashrc, etc.) for future sessions with Corona Lite. |
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: Tübingen Join Date: Jun 2009
Posts: 2
|
![]()
When I tried to register at ABI to download the CORONA-lite program, I did not receive a confirmation. Then I used the colour scheme given in
www.iscb.org/uploaded/css/36/12104.pdf to write a perl script that does the conversion. As far as I understood, the first base in the csfasta is part of the adaptor sequence and should therefore be omitted in the fasta. This can be triggered by setting the shift parameter to 1 (0 would repeat the first base). ./csfasta2fasta.pl seqence.csfasta 1 > output.fasta If anyone could tell me if this does approximately the same as the CORONA-lite conversion script, I would be happy. |
![]() |
![]() |
![]() |
#9 | |
Junior Member
Location: the People's Republic of China Join Date: Apr 2009
Posts: 1
|
![]() Quote:
ps: the translation of cs to bs loses the independent quality of adjacent color spaces. say, one miscalled colorspace in the middle will spoil the latter half bases. |
|
![]() |
![]() |
![]() |
#10 |
Junior Member
Location: CA Join Date: Jul 2009
Posts: 8
|
![]()
thank you for that tool,
what the hell is double encoded fasta? |
![]() |
![]() |
![]() |
#11 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
'Double-encoded' is where a color-space file is encoded as ACGT. Said ACGT is not base space but a way to encode the 0123 of color-space into something that non color-space aware programs can use.
As an example, given the base-space sequence: GTGCACCGTGCACG This encodes into color-space: G1131103113113 And can be double-encoded into: GCCTCCATCCTCCT Double-encoding is simple. 0 goes to 'A', 1 to 'C', etc. As I mention it is simply a way to make color-space into ACGT. I call it an abomination since it means nothing biologically useful yet looks like a biological sequence. It can lead to all sorts of false results if one does not realize what one is dealing with. |
![]() |
![]() |
![]() |
#12 |
Junior Member
Location: CA Join Date: Jul 2009
Posts: 8
|
![]()
thanks,
yes i can confirm that it leads to biological confusion. |
![]() |
![]() |
![]() |
#13 |
Junior Member
Location: CA Join Date: Jul 2009
Posts: 8
|
![]()
modified the conversion to avoid making that huge hash.
i was hitting memory limits the old way. |
![]() |
![]() |
![]() |
#14 |
Junior Member
Location: Indiana Join Date: Sep 2009
Posts: 5
|
![]()
The included colorspace -> basespace mapping is missing a few entries. Basically anything that includes a '4' or '.' is an N.
(Python format) __colorspace = { 'A0': 'A', 'A1': 'C', 'A2': 'G', 'A3': 'T', 'A4': 'N', 'A.': 'N', 'C0': 'C', 'C1': 'A', 'C2': 'T', 'C3': 'G', 'C4': 'N', 'C.': 'N', 'G0': 'G', 'G1': 'T', 'G2': 'A', 'G3': 'C', 'G4': 'N', 'G.': 'N', 'T0': 'T', 'T1': 'G', 'T2': 'C', 'T3': 'A' 'T4': 'N', 'T.': 'N', 'N0': 'N', 'N1': 'N', 'N2': 'N', 'N3': 'N', 'N.': 'N', } |
![]() |
![]() |
![]() |
#15 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
Actually you are also missing '5' and '6'. Also what about base-space that isn't an N (e.g., R, Y, etc.). Using a table like the above -- which is what the ABI-provided encodeFasta.py program uses -- is a poor way of handling the conversion IMHO. Unless you want to force non-1,2,3,4 to being a 4 and non-A,C,G,T to an N.
|
![]() |
![]() |
![]() |
#16 |
Junior Member
Location: Indiana Join Date: Sep 2009
Posts: 5
|
![]()
I actually just pulled the table from ABI. The attached perl scripts didn't handle N's at all.
I'm very new to this, and haven't run across 5 or 6 in our data. What do they stand for? (Ambiguity codes?) |
![]() |
![]() |
![]() |
#17 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
There are 3 transitions ... N to N; N to known (ACGT), known to N. These transitions can be represented by 3 different color-space numbers. In this case '4', '5', and '6'. Off the top of my head I do not remember which is which. Also only some of the ABI programs actually work with such this concept. encodeFasta.py, which the ABI SNP-calling manual says to use, does not handle any of the cases. It makes me wonder at times if ABI even uses their own programs on any real-life data. :-(
|
![]() |
![]() |
![]() |
#18 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
I was sitting here avoiding work -- I have an intractable problem, ugh! -- wondering where I had seen that 4,5,6 color-space encoding. So I looked it up. The 'dna_subroutines.pm' (perl library, obviously) has the following which is used by the 'convert_to_dibase' subroutine. At least of the 26 programs in the 'bin' directory use the 'dna_subroutines.pm' module although I am not certain if any use the convert_to_dibase routine. None seem to do directly. None of the the python routines use the 4,5,6 color-space encoding.
So ... using a '4' is probably good enough. $color{AN} = 4; $color{CN} = 4; $color{GN} = 4; $color{TN} = 4; $color{NA} = 5; $color{NC} = 5; $color{NG} = 5; $color{NT} = 5; $color{NN} = 6; |
![]() |
![]() |
![]() |
#19 |
Member
Location: LONDON, UNITED KINGDOM Join Date: Jan 2009
Posts: 44
|
![]()
How come NA, NC, NT, NG all have the same colorspace code '5'.
This means that once you have N for a given base you never know what is the next base? You don't know if it is A,G,C,T ... Right? Ines Last edited by inesdesantiago; 10-05-2009 at 07:58 AM. Reason: typo |
![]() |
![]() |
![]() |
#20 | |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]() Quote:
[Note: CS reads off of the sequencer will have a simple period (.) when there is an unknown and 0 through 3 for known ... 4,5,6s are only used when computationally processing BS->CS->BS translations] Let's go for an example. Say we have a (poor) reference sequence that in BS is: TCACGNGTCAAC Translating this into CS so that it can be mapped: T21134412101 Computationally if we tried to convert this CS back to BS we would get: TCACGNNNNNNN On the hand if we had an actual CS read from the sequencer such as: T21130012101 We can certainly map, allowing for mismatches, that actual read to our reference. If we had enough reads coming off the sequencer that were all the same as the above (or, better, had slightly different start points and also overlapped the region in question), then we could say with confidence that while our reference sequence indicated an 'N' in the position, our actual sequenced organism has a 'G'. Note that you can get into trouble with the above if your reads could potentially map to other parts of your organism and those parts are not part of your reference. This is a major reason for wanting different start sites and long reads. So tread with care. |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|