SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > SOLiD



Similar Threads
Thread Thread Starter Forum Replies Last Post
.abi to fasta/fastq conversion script/program? AppleInformatics General 12 08-26-2012 10:17 PM
csfasta to fasta? brachysclereid Bioinformatics 5 08-31-2011 09:27 AM
EMBL like file to FASTA conversion.. empyrean Bioinformatics 1 05-14-2011 12:49 AM
fastq to fasta conversion kwtennis311 Bioinformatics 4 06-11-2010 11:06 AM
Fasta to Ace conversion Farhat Bioinformatics 19 05-15-2010 06:08 PM

Reply
 
Thread Tools
Old 03-26-2009, 11:29 AM   #1
doxologist
Member
 
Location: USA

Join Date: Jan 2009
Posts: 96
Default csfasta --> fasta conversion

I have a fast trivial question:
what's the fastest/easier way to "decode" or convert the csfasta to fasta? I'm just doing this for a handful at a time for code-checking.

thanks in advance.
doxologist is offline   Reply With Quote
Old 03-26-2009, 07:01 PM   #2
lgoff
Member
 
Location: Cambridge, MA

Join Date: Feb 2008
Posts: 82
Default Comparing?

Are you looking to benchmark methods or just need to decode a set of sequences? Contact me directly if you would like a python module for SOLiD sequence manipulation(s) including csfasta --> fasta.
lgoff is offline   Reply With Quote
Old 03-26-2009, 09:01 PM   #3
Rao
Member
 
Location: India

Join Date: Oct 2008
Posts: 36
Default

You mean converting colorspace seq.. to basespace seq...
Rao is offline   Reply With Quote
Old 03-27-2009, 05:20 AM   #4
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

The ABI 'corona lite' programs (which are free) include 'encodeFasta.py' which will encode and decode to/from color-space, base-space and that abomination 'double-encoded'-space.
westerman is offline   Reply With Quote
Old 03-27-2009, 07:57 AM   #5
doxologist
Member
 
Location: USA

Join Date: Jan 2009
Posts: 96
Default

Quote:
Originally Posted by lgoff View Post
Are you looking to benchmark methods or just need to decode a set of sequences? Contact me directly if you would like a python module for SOLiD sequence manipulation(s) including csfasta --> fasta.
just for trivial conversion... decode
doxologist is offline   Reply With Quote
Old 05-06-2009, 12:27 PM   #6
jsun529
Member
 
Location: US

Join Date: Apr 2009
Posts: 52
Cool

Quote:
Originally Posted by westerman View Post
The ABI 'corona lite' programs (which are free) include 'encodeFasta.py' which will encode and decode to/from color-space, base-space and that abomination 'double-encoded'-space.
I get an error message run that code with :
ImportError: No module named agapython.util.Dibase

Where do I get the module? I run both code on Linux(ubuntu) and mac terminal, neither work
jsun529 is offline   Reply With Quote
Old 05-07-2009, 04:55 AM   #7
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

The module should come with corona lite. I suspect that you do not have your corona lite setup environment set up properly. From the README:

3) Configure your environment *

For csh/tcsh:
% setenv CORONAROOT <INSTALL_DIR>/corona_lite
% source $CORONAROOT/etc/profile.d/corona.csh

For sh/ksh/bash:
%export CORONAROOT=<INSTALL_DIR>/corona_lite
%source $CORONAROOT/etc/profile.d/corona.sh

* Remember to update your shell's init script (.cshrc, .bashrc,
etc.) for future sessions with Corona Lite.
westerman is offline   Reply With Quote
Old 06-23-2009, 04:15 AM   #8
roedel
Junior Member
 
Location: Tübingen

Join Date: Jun 2009
Posts: 2
Default csfasta -> fasta

When I tried to register at ABI to download the CORONA-lite program, I did not receive a confirmation. Then I used the colour scheme given in

www.iscb.org/uploaded/css/36/12104.pdf

to write a perl script that does the conversion. As far as I understood, the first base in the csfasta is part of the adaptor sequence and should therefore be omitted in the fasta. This can be triggered by setting the shift parameter to 1 (0 would repeat the first base).

./csfasta2fasta.pl seqence.csfasta 1 > output.fasta

If anyone could tell me if this does approximately the same as the CORONA-lite conversion script, I would be happy.
Attached Files
File Type: pl csfasta2fasta.pl (1.9 KB, 503 views)
roedel is offline   Reply With Quote
Old 06-23-2009, 05:59 AM   #9
chiuchengliu
Junior Member
 
Location: the People's Republic of China

Join Date: Apr 2009
Posts: 1
Default

Quote:
Originally Posted by roedel View Post
When I tried to register at ABI to download the CORONA-lite program, I did not receive a confirmation. Then I used the colour scheme given in

www.iscb.org/uploaded/css/36/12104.pdf

to write a perl script that does the conversion. As far as I understood, the first base in the csfasta is part of the adaptor sequence and should therefore be omitted in the fasta. This can be triggered by setting the shift parameter to 1 (0 would repeat the first base).

./csfasta2fasta.pl seqence.csfasta 1 > output.fasta

If anyone could tell me if this does approximately the same as the CORONA-lite conversion script, I would be happy.
Your script works well except for an extra ">\n" in the output file.

ps: the translation of cs to bs loses the independent quality of adjacent color spaces. say, one miscalled colorspace in the middle will spoil the latter half bases.
chiuchengliu is offline   Reply With Quote
Old 07-13-2009, 10:51 AM   #10
yoyoq
Junior Member
 
Location: CA

Join Date: Jul 2009
Posts: 8
Default

thank you for that tool,

what the hell is double encoded fasta?
yoyoq is offline   Reply With Quote
Old 07-13-2009, 11:13 AM   #11
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

'Double-encoded' is where a color-space file is encoded as ACGT. Said ACGT is not base space but a way to encode the 0123 of color-space into something that non color-space aware programs can use.

As an example, given the base-space sequence:

GTGCACCGTGCACG

This encodes into color-space:

G1131103113113

And can be double-encoded into:

GCCTCCATCCTCCT

Double-encoding is simple. 0 goes to 'A', 1 to 'C', etc. As I mention it is simply a way to make color-space into ACGT. I call it an abomination since it means nothing biologically useful yet looks like a biological sequence. It can lead to all sorts of false results if one does not realize what one is dealing with.
westerman is offline   Reply With Quote
Old 07-13-2009, 04:18 PM   #12
yoyoq
Junior Member
 
Location: CA

Join Date: Jul 2009
Posts: 8
Default thanks

thanks,
yes i can confirm that it leads to biological confusion.
yoyoq is offline   Reply With Quote
Old 08-14-2009, 05:02 PM   #13
yoyoq
Junior Member
 
Location: CA

Join Date: Jul 2009
Posts: 8
Default slight mod to conversion perl script

modified the conversion to avoid making that huge hash.
i was hitting memory limits the old way.
Attached Files
File Type: pl mycsfasta2fasta.pl (1.3 KB, 453 views)
yoyoq is offline   Reply With Quote
Old 09-23-2009, 07:04 AM   #14
mbreese
Junior Member
 
Location: Indiana

Join Date: Sep 2009
Posts: 5
Default

The included colorspace -> basespace mapping is missing a few entries. Basically anything that includes a '4' or '.' is an N.

(Python format)
__colorspace = {
'A0': 'A',
'A1': 'C',
'A2': 'G',
'A3': 'T',
'A4': 'N',
'A.': 'N',
'C0': 'C',
'C1': 'A',
'C2': 'T',
'C3': 'G',
'C4': 'N',
'C.': 'N',
'G0': 'G',
'G1': 'T',
'G2': 'A',
'G3': 'C',
'G4': 'N',
'G.': 'N',
'T0': 'T',
'T1': 'G',
'T2': 'C',
'T3': 'A'
'T4': 'N',
'T.': 'N',
'N0': 'N',
'N1': 'N',
'N2': 'N',
'N3': 'N',
'N.': 'N',
}
mbreese is offline   Reply With Quote
Old 09-23-2009, 07:28 AM   #15
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Actually you are also missing '5' and '6'. Also what about base-space that isn't an N (e.g., R, Y, etc.). Using a table like the above -- which is what the ABI-provided encodeFasta.py program uses -- is a poor way of handling the conversion IMHO. Unless you want to force non-1,2,3,4 to being a 4 and non-A,C,G,T to an N.
westerman is offline   Reply With Quote
Old 09-23-2009, 07:38 AM   #16
mbreese
Junior Member
 
Location: Indiana

Join Date: Sep 2009
Posts: 5
Default

I actually just pulled the table from ABI. The attached perl scripts didn't handle N's at all.

I'm very new to this, and haven't run across 5 or 6 in our data. What do they stand for? (Ambiguity codes?)
mbreese is offline   Reply With Quote
Old 09-23-2009, 07:51 AM   #17
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

There are 3 transitions ... N to N; N to known (ACGT), known to N. These transitions can be represented by 3 different color-space numbers. In this case '4', '5', and '6'. Off the top of my head I do not remember which is which. Also only some of the ABI programs actually work with such this concept. encodeFasta.py, which the ABI SNP-calling manual says to use, does not handle any of the cases. It makes me wonder at times if ABI even uses their own programs on any real-life data. :-(
westerman is offline   Reply With Quote
Old 09-23-2009, 09:47 AM   #18
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

I was sitting here avoiding work -- I have an intractable problem, ugh! -- wondering where I had seen that 4,5,6 color-space encoding. So I looked it up. The 'dna_subroutines.pm' (perl library, obviously) has the following which is used by the 'convert_to_dibase' subroutine. At least of the 26 programs in the 'bin' directory use the 'dna_subroutines.pm' module although I am not certain if any use the convert_to_dibase routine. None seem to do directly. None of the the python routines use the 4,5,6 color-space encoding.

So ... using a '4' is probably good enough.

$color{AN} = 4;
$color{CN} = 4;
$color{GN} = 4;
$color{TN} = 4;
$color{NA} = 5;
$color{NC} = 5;
$color{NG} = 5;
$color{NT} = 5;
$color{NN} = 6;
westerman is offline   Reply With Quote
Old 10-05-2009, 06:38 AM   #19
inesdesantiago
Member
 
Location: LONDON, UNITED KINGDOM

Join Date: Jan 2009
Posts: 44
Default "N" in basespace

How come NA, NC, NT, NG all have the same colorspace code '5'.
This means that once you have N for a given base you never know what is the next base? You don't know if it is A,G,C,T ...
Right?
Ines

Last edited by inesdesantiago; 10-05-2009 at 06:58 AM. Reason: typo
inesdesantiago is offline   Reply With Quote
Old 10-05-2009, 07:26 AM   #20
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by inesdesantiago View Post
How come is NA, NC, NT, NG all have the same code '5'.
This means that once you have N for a given base you never know what is the next base? You don't know if it is A,G,C,T ...
Right?
Ines
That is only partially correct but for the first approximation it is correct. You certainly can not properly decode from colorspace (CS) into basespace (BS) if there are 4,5, or 6s in the CS. However this does not keep you from using the information in matching.

[Note: CS reads off of the sequencer will have a simple period (.) when there is an unknown and 0 through 3 for known ... 4,5,6s are only used when computationally processing BS->CS->BS translations]

Let's go for an example.

Say we have a (poor) reference sequence that in BS is:

TCACGNGTCAAC

Translating this into CS so that it can be mapped:

T21134412101

Computationally if we tried to convert this CS back to BS we would get:

TCACGNNNNNNN

On the hand if we had an actual CS read from the sequencer such as:

T21130012101

We can certainly map, allowing for mismatches, that actual read to our reference. If we had enough reads coming off the sequencer that were all the same as the above (or, better, had slightly different start points and also overlapped the region in question), then we could say with confidence that while our reference sequence indicated an 'N' in the position, our actual sequenced organism has a 'G'.

Note that you can get into trouble with the above if your reads could potentially map to other parts of your organism and those parts are not part of your reference. This is a major reason for wanting different start sites and long reads. So tread with care.
westerman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:38 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO