Seqanswers Leaderboard Ad

**joa_ds** · 12-17-2008, 07:47 AM

good question.

First of all, everybody says their machine is the best choice (of course, they bought it...). No real benchmarking has been done in the past.

They all have their own good points and weak points...

But a solexa and a Solid are waaaay different. Just the chemistry is totally different. Nucleotide vs color space. I can imagine the solid pipeline will be totally different. Maybe only the image processing can be somewhat the same, but even then...

http://www.in-sequence.com/issues/2_1/webreprints/144249-1.html

happy reading

**new300** · 12-19-2008, 02:37 AM

Originally posted by joa_ds View Post

good question.

First of all, everybody says their machine is the best choice (of course, they bought it...). No real benchmarking has been done in the past.

I think a bunch of benchmarking has been done, by genome centres for example. But I don't think much of it has been published.

From what I've heard SNP call error rate on a good SOLiD run is about the same as an Illumina. I don't know what the run failure rate is like.

From by brief look at the single colour change error rate on the SOLiD it's somewhere around 7%, that was in the E.Coli data release they did: http://www.genographia.org/portal/to...rimer.pdf/view Illumina error rate is around 1 or 2% on a good run (including contamination in both cases, doing a brute force alignment). The other issue is that a single error in a SOLiD read effectively corrupts the rest of the read, unless you have a reference. So for anything de novo, you're a bit stuck.

My feeling is that the market is showing that right now the GA is a more versatile platform, with longer reads and a lower base error rate. If you look at the number of GA publications against the number of SOLiD publications that gives you a good idea of how useful people are finding that data. There's a neat graph here: www.mrgc.com.my

**Chipper** · 12-19-2008, 03:36 AM

Originally posted by new300 View Post

I think a bunch of benchmarking has been done, by genome centres for example. But I don't think much of it has been published.

From what I've heard SNP call error rate on a good SOLiD run is about the same as an Illumina. I don't know what the run failure rate is like.

From by brief look at the single colour change error rate on the SOLiD it's somewhere around 7%, that was in the E.Coli data release they did: http://www.genographia.org/portal/to...rimer.pdf/view Illumina error rate is around 1 or 2% on a good run (including contamination in both cases, doing a brute force alignment). The other issue is that a single error in a SOLiD read effectively corrupts the rest of the read, unless you have a reference. So for anything de novo, you're a bit stuck.

My feeling is that the market is showing that right now the GA is a more versatile platform, with longer reads and a lower base error rate. If you look at the number of GA publications against the number of SOLiD publications that gives you a good idea of how useful people are finding that data. There's a neat graph here: www.mrgc.com.my

1. Why do you compare the single color change to the base call error rate on Illumina? 2. SOLiD de novo assembly can be done with error correction at least with Velvet. 3. The publication numbers reflect more the ratio of Ill:SOLiD in use than anything else I guess. With the coming updates it will probably be competetive with Ill. in terms of handling as well.

**new300** · 12-19-2008, 04:25 AM

Originally posted by Chipper View Post

1. Why do you compare the single color change to the base call error rate on Illumina?

Because that's what you'll have to deal with if you're doing de novo stuff. The 2 colour change makes it comparable for SNP calling but not de novo...

Originally posted by Chipper View Post

2. SOLiD de novo assembly can be done with error correction at least with Velvet. 3.

Error correction only buys you so much, and the quality of your assembly will be a function of the base error rate and read length. GA reads have a lower error rate... so I think for this application are probably better.

I've not seen any de novo SOLiD assemblies so if anybody has this I'd be interested in taking a look.

Originally posted by Chipper View Post

The publication numbers reflect more the ratio of Ill:SOLiD in use than anything else I guess.

I think the fact that there are more GAs in use reflects the fact that people prefer them and get more data out of them...

Originally posted by Chipper View Post

With the coming updates it will probably be competetive with Ill. in terms of handling as well.

Yep, we'll have to wait and see the markets always changing.

**jkbonfield** · 01-06-2009, 01:35 AM

Originally posted by Chipper View Post

1. Why do you compare the single color change to the base call error rate on Illumina? 2. SOLiD de novo assembly can be done with error correction at least with Velvet. 3. The publication numbers reflect more the ratio of Ill:SOLiD in use than anything else I guess. With the coming updates it will probably be competetive with Ill. in terms of handling as well.

When doing denovo sequence assembly you're essentially aligning in colour space, just treating the 0, 1, 2 and 3 as 4 characters to align (eg rename them, albeit misleadingly, ACGT if it makes programs work). In this respect it's incorrect to claim that a single error makes the rest of the read unusable from that point on. However it's also incorrect to assume you need two adjacent errors for a problem to arise.

For what it's worth even when doing mapping experiments the single error rate DOES still somewhat matter. It directly relates to your mapping confidence values. Have too many errors and you'll find it both hard to map and also have a significant chance of placing things in the wrong location.

**new300** · 01-06-2009, 06:43 AM

Originally posted by jkbonfield View Post

When doing denovo sequence assembly you're essentially aligning in colour space, just treating the 0, 1, 2 and 3 as 4 characters to align (eg rename them, albeit misleadingly, ACGT if it makes programs work). In this respect it's incorrect to claim that a single error makes the rest of the read unusable from that point on. However it's also incorrect to assume you need two adjacent errors for a problem to arise.

I probably didn't explain myself very well. If you align in colour space your "colour assembly" will probably be ok. However you need to get back in to base space at some point. The two options I can see are:

1. You translate the assembled colour space contigs in to base space. This will be bad because a single error will corrupt the rest of contig.

2. You align in colour space, then translate the individual reads back in to basespace. In this case you limit corruption to that remaining part of that single read. So... the read was useful for building contigs. But unless you were able to correct the error not for basecalling.

In practice I'd expect this to cause significant issues for de novo assembly, but there might be ways round these issues I've not considered.

**jkbonfield** · 01-06-2009, 07:07 AM

Originally posted by new300 View Post

1. You translate the assembled colour space contigs in to base space. This will be bad because a single error will corrupt the rest of contig.

You sort of want to avoid doing this until as late as possible - but fundamentally there'll come a time when it needs to be done before the assembly has been finished. Eg to merge with other data or to start sequence analysis on an unfinished genome.

Originally posted by new300 View Post

2. You align in colour space, then translate the individual reads back in to basespace. In this case you limit corruption to that remaining part of that single read. So... the read was useful for building contigs. But unless you were able to correct the error not for basecalling.

Well that individual read's contribution was poor for the consensus generation, but the rest are hopefully enough to compensate.

I think there's a 3rd route too which is a combination of 1 and 2 above. You can compute the consensus from all reads in colour space, like option 1, before converting to DNA space for use in other tools. However using the known last base of the primer for each read we can verify whether the sequence matches a consensus. If it doesn't then it implies in the last few bases a consensus colour call was incorrect and our colour to dna conversion became out of sync.

Essentially this is using the last primer base as an auto-correction system to ensure that we always know which of the 4 "phases" the colour to base conversion system should be in. If we have sufficient depth then we'll get the resolution quite high, possibly to the base level (say 25 fold and above). It's not as robust as comparison against a reference sequence and SNP correction as we only have one correcting factor per read rather than per base, but there's sufficient information to use still. This of course assumes that the assembly is correct. Misassemblies would cause problems still.

Rather messy and personally not appealing unless we could see some tangible gain for having to go through the extra hoops.

**lh3** · 01-07-2009, 01:42 AM

I second James' 3rd idea. Both AB and maq implement "reference-based translation" from color space to nucleotide space. Such translation is very robust to color errors: if the read mapping is right, we can confidently correct most of color errors. I do not know how AB achieves this, maq does so with a simple O(4*4*L)-time dynamical programming (the DP part is just in 50 lines of C codes). This DP can also realize James' idea: we take color contig as a "read" and take the sequence of first nucleotide on each real individual read as "reference"; some holes on the "reference" should not matter too much. Translating color contig in this way is also very robust to color errors.

**new300** · 01-07-2009, 03:14 AM

Originally posted by jkbonfield View Post

I think there's a 3rd route too which is a combination of 1 and 2 above. You can compute the consensus from all reads in colour space, like option 1, before converting to DNA space for use in other tools. However using the known last base of the primer for each read we can verify whether the sequence matches a consensus. If it doesn't then it implies in the last few bases a consensus colour call was incorrect and our colour to dna conversion became out of sync.

Essentially this is using the last primer base as an auto-correction system to ensure that we always know which of the 4 "phases" the colour to base conversion system should be in.

ok yes. If I've understood correctly:

You build a colour space consensus. Align colour space reads to it. Then convert the consensus and the first bases of the read to base space (on the fly). If you compare these and you get a mismatch, then you know your consensus has gone out of phase (or there was some really horrible error in the read).

So... when you do detect an error the bases between the last know good initial base, and next known good are in doubt. To get this down to base resolution you'd need coverage==read length.

I think that's a neat trick that could help out a fair bit, though like you say it's not as robust as the SNP trick.

I think there is some probability of error in that first base. It's a single colour change error so based on the data I've seen I'd guess around 2% as for a normal base it's about 6->8%... So you could end up marking a bad/good, good/bad. The resolution would need to be high to avoid this effecting a large number of bases.

Originally posted by jkbonfield View Post

If we have sufficient depth then we'll get the resolution quite high, possibly to the base level (say 25 fold and above). It's not as robust as comparison against a reference sequence and SNP correction as we only have one correcting factor per read rather than per base, but there's sufficient information to use still. This of course assumes that the assembly is correct. Misassemblies would cause problems still.

Rather messy and personally not appealing unless we could see some tangible gain for having to go through the extra hoops.

It sounds like a technique like this would be the right kind of strategy for Solid data. I think it's a work around for the problems caused by 2 colour changes in de novo assembly though, it's not buying you an error correction (which respect to basespace) like with the SNP stuff.

So when your doing your actual assembly you're still left with the single colour change error rate when overlapping. I think with an error rate this high, and short reads you'd be lucky to produce a consensus good enough to work from... maybe if you could filter out a lot of the errors...

Right now I can't see that the Solids are likely to be competitive for de novo, not compared with the GAs. Read lengths would need to be longer and the single colour change error rate lower. Either that or they'd need a throughput advantage of at least an order of magnitude.

**Chipper** · 01-07-2009, 11:05 AM

Is an error rate of 6-8 % (single color change) really normal? How is this value calculated and what is the corrisponding error rate for the Illumina? Are the numbers affected by the lack of filtering of empty or mixed beads on the SOLiD and if so would it be better to apply quality filtering on SOLiD data before doing de novo assembly?

**new300** · 01-07-2009, 12:17 PM

Originally posted by Chipper View Post

Is an error rate of 6-8 % (single color change) really normal? How is this value calculated and what is the corrisponding error rate for the Illumina? Are the numbers affected by the lack of filtering of empty or mixed beads on the SOLiD and if so would it be better to apply quality filtering on SOLiD data before doing de novo assembly?

It's what I saw when I did a brute force alignment of the E.Coli data release (http://www.genographia.org/portal/to...rimer.pdf/view). My understanding is that empty beads are filtered early on. There were also no reads with more than 8 errors. This makes me think some filtering had been applied. Some MSc students I was working with also saw similar error rates in the Yorubian dataset.

You also should also be able to calculate the single colour change error rate from the SNP miscall rate. I've seen this quoted as 0.036 which I think should be roughly equivalent to a single colour change error rate of 6%. Those are the only numbers I have to go on, a comprehensive review would be useful.

I think additional filtering would help, it's a trade off between that throughput.

As delivered by device/quoted in throughput numbers the Illumina error rate is around 1%. They apply relatively harsh filtering to remove mixed clusters during primary data analysis. I'd be interesting to see a Solid dataset where filtering had been applied to get the single colour change error rate down to 1%, that would make for a useful comparison.

**bioinfosm** · 02-18-2009, 09:52 AM

I am still curious as to how SOLiD and Solexa compare apples-to-apples. Both produce short reads, but still not much about how similar or complementary they are!

Met a few at AGBT and still could not find the answers..

**westerman** · 02-19-2009, 11:04 AM

Wasn't there a paper within the last several months which compared all three platforms and basically came up with the conclusion that all three platforms were equally good -- at least on bacteria. The SOLiD may have come out ahead on SNP calling.

I believe the problem is not apples-to-apples but rather the other considerations:

(1) Ease of lab prep.
(2) Cost of running.
(3) Length of reads.
(4) Number of reads.
(5) Which machines my organization will pony up the money for. :-)

My organization has two sequencers -- a 454 and a SOLiD. As a computer guy which do I like better? It depends on the project. Would I like a Solexa? Sure. Supposedly easier chemistry than the SOLiD with longer reads but more expensive to run with fewer reads and not as good SNP calling as the SOLiD. But heck, if the powers that be want to buy us a Solexa and pay for the service contract ... well, I suspect that we find room in our already over-crowded lab for it.

What I really want to see is a paper comparing the sequencing of repetitive eukaryotic organisms (not human!) when given a project with "X" dollars to spend and "Y" weeks to complete it.

**new300** · 02-19-2009, 12:19 PM

Originally posted by westerman View Post

Wasn't there a paper within the last several months which compared all three platforms and basically came up with the conclusion that all three platforms were equally good -- at least on bacteria. The SOLiD may have come out ahead on SNP calling.

I remember seeing this but not having the time to read it, do you have the citation?

Originally posted by westerman View Post

I believe the problem is not apples-to-apples but rather the other considerations:

(1) Ease of lab prep.
(2) Cost of running.
(3) Length of reads.
(4) Number of reads.
(5) Which machines my organization will pony up the money for. :-)

Agreed and it's not always a question of purchasing but freebies also get handed out to promote a product. It all muddies the water somewhat.

Originally posted by westerman View Post

My organization has two sequencers -- a 454 and a SOLiD. As a computer guy which do I like better? It depends on the project. Would I like a Solexa? Sure. Supposedly easier chemistry than the SOLiD with longer reads but more expensive to run with fewer reads and not as good SNP calling as the SOLiD. But heck, if the powers that be want to buy us a Solexa and pay for the service contract ... well, I suspect that we find room in our already over-crowded lab for it.

How many raw and aligned reads per run do you get out of your Solid?

Originally posted by westerman View Post

What I really want to see is a paper comparing the sequencing of repetitive eukaryotic organisms (not human!) when given a project with "X" dollars to spend and "Y" weeks to complete it.

I guess what you really want is to look at a variety of sequence structure for a variety of applications (SNP calling, de novo assembly, CNV, structural variant stuff etc.). Be very interesting.

Most of the genome centers seem to be gearing up with Illuminas at the moment. Sanger have 40 odd, WashU 35... I've not seem much hard evidence to backup Solids but then I've mostly worked with Solexa data.

Topics	Statistics	Last Post
New Software Simplifies 3D Gene Expression Mapping by seqadmin Started by seqadmin, Yesterday, 10:17 AM	0 responses 7 views 0 reactions	Last Post by seqadmin Yesterday, 10:17 AM
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 59 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM
Mapping the snoRNAome in Zebrafish to Advance Disease Research by seqadmin Started by seqadmin, 03-18-2025, 12:50 PM	0 responses 50 views 0 reactions	Last Post by seqadmin 03-18-2025, 12:50 PM

Seqanswers Leaderboard Ad

Solid VS Solexa

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News