Unconfigured Ad

**snetmcom** · 01-21-2009, 08:47 PM

http://www.softgenetics.com/NextGENe.html

there are some others, but I dont think they are using CS correctly.

**doxologist** · 01-21-2009, 11:41 PM

thanks for your response. Have you had experience with NextGENe? It doesnt seem that many people are using it. I'm trying it as well, but it seems to translate CS data to fasta before matching - doesn't this loose the ability to match to mismatches (since CS mismatches change all the subsequent nucleotides)?

**Roald** · 01-29-2009, 01:28 AM

Disclaimer: I work at CLC bio

We have just included native color space assembly in our NGS cell software

Home - QIAGEN Digital Insights

http://clcbio.com/index.php?id=1331

Welcome to QIAGEN Digital Insights LabCorp uses QCI and HGMD to improve identification and interpretation of genetic variants within inhereited diseases.Read...

You can grab a white paper with benchmarks at http://clcbio.com/index.php?id=1368

Cheers

Roald

**doxologist** · 02-12-2009, 12:34 PM

I am now trying NextGenE and it seems that it translates colorspace data to fasta first before the analysis. Is this correct? If so, it seems that there is much potential error (0-2) and not a method recommended by ABI. Does this seem right?

**doxologist** · 02-12-2009, 12:35 PM

Thanks Ronald. I'll take a look at CLC bio.

**westerman** · 02-12-2009, 01:09 PM

Originally posted by doxologist View Post

I am now trying NextGenE and it seems that it translates colorspace data to fasta first before the analysis. Is this correct? If so, it seems that there is much potential error (0-2) and not a method recommended by ABI. Does this seem right?

I am not familiar with NextGenE but if they are indeed translating to base-space instead of doing their work within color-space then, yes, there is a great potential for error. Unlike traditional sequencing technologies where a single miscall would only affect that particular base, in the Solid a miscall will affect all downstream bases. Also by not working in color-space then one missing a large strength of the Solid -- great SNP calling.

**Roald** · 02-12-2009, 01:40 PM

You are both absolutely right that a huge amount of information is lost by aligning SOLiD data in sequence space, rather than in color space.
The benchmarks we have made (see http://clcbio.com/index.php?id=1368 ) showed that the number of aligned reads increase by over 80% when reads are aligned in color space rather than in sequence space. This example is for reads of length 35 and the tendency will only increase as reads get longer.

**doxologist** · 02-12-2009, 01:53 PM

Hmm... great.. thanks for the info. Perhaps it is already addressed... how does CLC Bio compare with Zoom and BFAST?

**Mr Mutundes** · 02-18-2009, 04:44 PM

Allow me to ask what may be a dumb question...

If I "double encode" (to use the ABI term) both my reads and my reference sequence (so that colors are represented by ACGTs), then why can't I use bowtie, blat, blastall or whatever alignment program I like and expect success? Sure there would be some post-alignment work involved in distinguishing biological variants from sequencing errors but I don't see why the alignment itself wouldn't be valid and useful.

Thanks

**ECO** · 02-18-2009, 09:35 PM

Originally posted by Mr Mutundes View Post

Allow me to ask what may be a dumb question...

If I "double encode" (to use the ABI term) both my reads and my reference sequence (so that colors are represented by ACGTs), then why can't I use bowtie, blat, blastall or whatever alignment program I like and expect success? Sure there would be some post-alignment work involved in distinguishing biological variants from sequencing errors but I don't see why the alignment itself wouldn't be valid and useful.

Thanks

Hey! Your answer is in Post #7 above!

**Mr Mutundes** · 02-19-2009, 01:17 AM

no no no! "Double encoding" doesn't put you in base space!

Let me put the question again: a sequence of colors is often represented by digits, but can just as easily be represented by characters ACGT (somewhere in the AB corona lite stuff this is referred to as "double encoding") . If I have a query sequence and a target sequence both encoded this way then because they both "look like" nucleotide sequences they are acceptable as input to standard nucleotide alignment programs. But what is being aligned are two color sequences, not two base sequences. So if there is a color sequencing error the alignment will NOT be perturbed as it would be in an alignment done in " base space". (I think...) So - why can't we use traditional alignment programs?

Happy to be corrected!

**westerman** · 02-19-2009, 09:45 AM

There are at least three problems, Mr Mutundes, with using double-encoded sequences with traditional alignment programs.

(1) As I mentioned above, a single color (or double-encoded) change in the start of the sequence will decode to entirely different base sequences.

(2) Related to the above, opposite strands do not match. Thus you have to tell your traditional program to align to one strand at a time.

(3) Traditional programs expect that a SNP to a single base change. Sequencing errors are also a single base. However in color space (and thus double-encoded space) SNPs are sequential changes and errors are a single change.

In summary the problem is not double-encoding per se -- as you point out it should not matter if the alphabet 0, 1, 2, 3 or the alphabet A, C, G, T is used. Rather the problem is that traditional programs do not know how to cope with the power and weakness of color-space.

Sitting down in front of a chalkboard with another person does a lot for the 'ah-ha!' discovery moment. Since I can not do that with you I will instead use my next couple of messages as a way to convey the above ideas. I assume that you know how color-space encoding is done by the sequencer. Also for ease of typing I will use runs of 7 bases instead of the normal 25 or 35 or (eventually) more.

**westerman** · 02-19-2009, 10:02 AM

Single change causes big problems.

If I have two reads in color space

(1 CS) T3232032
(2 CS) T1232032

Which are the actual bases in base space

(1 BS) ACGTTAG
(2 BS) GATCCGA

And in double-encoded space without primer trimming:

(1 DEN) TTGTGATG
(2 DEN) TCGTGATG

Or in the more proper primer trimmed double-encoding (since the primer means something different than the double-encoding; e.g., the 'T' primer is actually a 'T' and not a substitute for the number '3'):

(1 DET) GTGATG
(2 DET) GTGATG

So now you take the double-encoded trimmed (DET) reads and put them into a traditional assembler. Congratulations, you have now assembled ACGTTAG and GATCCGA together!

Even if you take the double-encoded non-trimmed reads and put them through a traditional assembler then you end up with the same incorrect assembly since 7 of the 8 double-encoded bases align. Note that this percentage is even more against you if you are using 25- or 35-base reads. If you insist that your assembler make exact matches (8 of 8 in this case) then you never get adjacent overlaps and thus no contigs.

**westerman** · 02-19-2009, 10:10 AM

Opposite strand reads do not align

I am using a repetitive sequence here but the same idea is true for non-repeat areas.

In color space there are two reads:

(CS 1) T0000000
(CS 2) T3000000

These represent in base space:

(BS 1) TTTTTTT
(BS 2) AAAAAAA

If these are reads on opposite strands then they should align. So let's convert them into double-encoding and put them through a traditional alignment program.

(DET 1) AAAAAA
(DET 2) AAAAAA

Ooops! It is going to be hard to find any alignment that way!

Topics	Statistics	Last Post
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, Today, 10:08 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Today, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, Yesterday, 11:05 AM	0 responses 8 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM

Unconfigured Ad

Third Party Software for Colorspace data?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News