SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Which technology is better for CNV detection, SOLiD or Solexa? El Mariachi SOLiD 0 12-30-2010 01:45 AM
Solexa Illumina Vs SOLiD Poll and results srirang General 10 03-01-2010 12:29 AM
read length of SOLiD and Solexa seqAll General 8 12-16-2009 05:50 AM
SOLiD v3 vs Solexa coonya SOLiD 7 07-15-2009 01:21 AM
SOLiD/SOLEXA?? FLXTech SOLiD 0 07-08-2009 01:51 AM

Reply
 
Thread Tools
Old 12-17-2008, 02:51 AM   #1
fabio25
Member
 
Location: italy

Join Date: Aug 2008
Posts: 13
Default Solid VS Solexa

Dear Everybody,
I would like to ask if someone can help me to understand the difference among the Solid machine and the Solexa one. Which tools are used to analyze such type of data and which is the difference as wet work?
Thanks a lot
fabio25 is offline   Reply With Quote
Old 12-17-2008, 07:47 AM   #2
joa_ds
Member
 
Location: belgium

Join Date: Dec 2008
Posts: 52
Cool

good question.

First of all, everybody says their machine is the best choice (of course, they bought it...). No real benchmarking has been done in the past.

They all have their own good points and weak points...

But a solexa and a Solid are waaaay different. Just the chemistry is totally different. Nucleotide vs color space. I can imagine the solid pipeline will be totally different. Maybe only the image processing can be somewhat the same, but even then...

http://www.in-sequence.com/issues/2_.../144249-1.html

happy reading
joa_ds is offline   Reply With Quote
Old 12-19-2008, 02:37 AM   #3
new300
Member
 
Location: northern hemisphere

Join Date: Mar 2008
Posts: 50
Default

Quote:
Originally Posted by joa_ds View Post
good question.

First of all, everybody says their machine is the best choice (of course, they bought it...). No real benchmarking has been done in the past.
I think a bunch of benchmarking has been done, by genome centres for example. But I don't think much of it has been published.

From what I've heard SNP call error rate on a good SOLiD run is about the same as an Illumina. I don't know what the run failure rate is like.

From by brief look at the single colour change error rate on the SOLiD it's somewhere around 7%, that was in the E.Coli data release they did: http://www.genographia.org/portal/to...rimer.pdf/view Illumina error rate is around 1 or 2% on a good run (including contamination in both cases, doing a brute force alignment). The other issue is that a single error in a SOLiD read effectively corrupts the rest of the read, unless you have a reference. So for anything de novo, you're a bit stuck.

My feeling is that the market is showing that right now the GA is a more versatile platform, with longer reads and a lower base error rate. If you look at the number of GA publications against the number of SOLiD publications that gives you a good idea of how useful people are finding that data. There's a neat graph here: www.mrgc.com.my
new300 is offline   Reply With Quote
Old 12-19-2008, 03:36 AM   #4
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Quote:
Originally Posted by new300 View Post
I think a bunch of benchmarking has been done, by genome centres for example. But I don't think much of it has been published.

From what I've heard SNP call error rate on a good SOLiD run is about the same as an Illumina. I don't know what the run failure rate is like.

From by brief look at the single colour change error rate on the SOLiD it's somewhere around 7%, that was in the E.Coli data release they did: http://www.genographia.org/portal/to...rimer.pdf/view Illumina error rate is around 1 or 2% on a good run (including contamination in both cases, doing a brute force alignment). The other issue is that a single error in a SOLiD read effectively corrupts the rest of the read, unless you have a reference. So for anything de novo, you're a bit stuck.

My feeling is that the market is showing that right now the GA is a more versatile platform, with longer reads and a lower base error rate. If you look at the number of GA publications against the number of SOLiD publications that gives you a good idea of how useful people are finding that data. There's a neat graph here: www.mrgc.com.my
1. Why do you compare the single color change to the base call error rate on Illumina? 2. SOLiD de novo assembly can be done with error correction at least with Velvet. 3. The publication numbers reflect more the ratio of Ill:SOLiD in use than anything else I guess. With the coming updates it will probably be competetive with Ill. in terms of handling as well.
Chipper is offline   Reply With Quote
Old 12-19-2008, 04:25 AM   #5
new300
Member
 
Location: northern hemisphere

Join Date: Mar 2008
Posts: 50
Default

Quote:
Originally Posted by Chipper View Post
1. Why do you compare the single color change to the base call error rate on Illumina?
Because that's what you'll have to deal with if you're doing de novo stuff. The 2 colour change makes it comparable for SNP calling but not de novo...

Quote:
Originally Posted by Chipper View Post
2. SOLiD de novo assembly can be done with error correction at least with Velvet. 3.
Error correction only buys you so much, and the quality of your assembly will be a function of the base error rate and read length. GA reads have a lower error rate... so I think for this application are probably better.

I've not seen any de novo SOLiD assemblies so if anybody has this I'd be interested in taking a look.

Quote:
Originally Posted by Chipper View Post
The publication numbers reflect more the ratio of Ill:SOLiD in use than anything else I guess.
I think the fact that there are more GAs in use reflects the fact that people prefer them and get more data out of them...

Quote:
Originally Posted by Chipper View Post
With the coming updates it will probably be competetive with Ill. in terms of handling as well.
Yep, we'll have to wait and see the markets always changing.
new300 is offline   Reply With Quote
Old 01-06-2009, 01:35 AM   #6
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Quote:
Originally Posted by Chipper View Post
1. Why do you compare the single color change to the base call error rate on Illumina? 2. SOLiD de novo assembly can be done with error correction at least with Velvet. 3. The publication numbers reflect more the ratio of Ill:SOLiD in use than anything else I guess. With the coming updates it will probably be competetive with Ill. in terms of handling as well.
When doing denovo sequence assembly you're essentially aligning in colour space, just treating the 0, 1, 2 and 3 as 4 characters to align (eg rename them, albeit misleadingly, ACGT if it makes programs work). In this respect it's incorrect to claim that a single error makes the rest of the read unusable from that point on. However it's also incorrect to assume you need two adjacent errors for a problem to arise.

For what it's worth even when doing mapping experiments the single error rate DOES still somewhat matter. It directly relates to your mapping confidence values. Have too many errors and you'll find it both hard to map and also have a significant chance of placing things in the wrong location.
jkbonfield is offline   Reply With Quote
Old 01-06-2009, 06:43 AM   #7
new300
Member
 
Location: northern hemisphere

Join Date: Mar 2008
Posts: 50
Default

Quote:
Originally Posted by jkbonfield View Post
When doing denovo sequence assembly you're essentially aligning in colour space, just treating the 0, 1, 2 and 3 as 4 characters to align (eg rename them, albeit misleadingly, ACGT if it makes programs work). In this respect it's incorrect to claim that a single error makes the rest of the read unusable from that point on. However it's also incorrect to assume you need two adjacent errors for a problem to arise.
I probably didn't explain myself very well. If you align in colour space your "colour assembly" will probably be ok. However you need to get back in to base space at some point. The two options I can see are:

1. You translate the assembled colour space contigs in to base space. This will be bad because a single error will corrupt the rest of contig.

2. You align in colour space, then translate the individual reads back in to basespace. In this case you limit corruption to that remaining part of that single read. So... the read was useful for building contigs. But unless you were able to correct the error not for basecalling.

In practice I'd expect this to cause significant issues for de novo assembly, but there might be ways round these issues I've not considered.
new300 is offline   Reply With Quote
Old 01-06-2009, 07:07 AM   #8
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Quote:
Originally Posted by new300 View Post
1. You translate the assembled colour space contigs in to base space. This will be bad because a single error will corrupt the rest of contig.
You sort of want to avoid doing this until as late as possible - but fundamentally there'll come a time when it needs to be done before the assembly has been finished. Eg to merge with other data or to start sequence analysis on an unfinished genome.

Quote:
Originally Posted by new300 View Post
2. You align in colour space, then translate the individual reads back in to basespace. In this case you limit corruption to that remaining part of that single read. So... the read was useful for building contigs. But unless you were able to correct the error not for basecalling.
Well that individual read's contribution was poor for the consensus generation, but the rest are hopefully enough to compensate.

I think there's a 3rd route too which is a combination of 1 and 2 above. You can compute the consensus from all reads in colour space, like option 1, before converting to DNA space for use in other tools. However using the known last base of the primer for each read we can verify whether the sequence matches a consensus. If it doesn't then it implies in the last few bases a consensus colour call was incorrect and our colour to dna conversion became out of sync.

Essentially this is using the last primer base as an auto-correction system to ensure that we always know which of the 4 "phases" the colour to base conversion system should be in. If we have sufficient depth then we'll get the resolution quite high, possibly to the base level (say 25 fold and above). It's not as robust as comparison against a reference sequence and SNP correction as we only have one correcting factor per read rather than per base, but there's sufficient information to use still. This of course assumes that the assembly is correct. Misassemblies would cause problems still.

Rather messy and personally not appealing unless we could see some tangible gain for having to go through the extra hoops.
jkbonfield is offline   Reply With Quote
Old 01-07-2009, 01:42 AM   #9
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I second James' 3rd idea. Both AB and maq implement "reference-based translation" from color space to nucleotide space. Such translation is very robust to color errors: if the read mapping is right, we can confidently correct most of color errors. I do not know how AB achieves this, maq does so with a simple O(4*4*L)-time dynamical programming (the DP part is just in 50 lines of C codes). This DP can also realize James' idea: we take color contig as a "read" and take the sequence of first nucleotide on each real individual read as "reference"; some holes on the "reference" should not matter too much. Translating color contig in this way is also very robust to color errors.
lh3 is offline   Reply With Quote
Old 01-07-2009, 03:14 AM   #10
new300
Member
 
Location: northern hemisphere

Join Date: Mar 2008
Posts: 50
Default

Quote:
Originally Posted by jkbonfield View Post
I think there's a 3rd route too which is a combination of 1 and 2 above. You can compute the consensus from all reads in colour space, like option 1, before converting to DNA space for use in other tools. However using the known last base of the primer for each read we can verify whether the sequence matches a consensus. If it doesn't then it implies in the last few bases a consensus colour call was incorrect and our colour to dna conversion became out of sync.

Essentially this is using the last primer base as an auto-correction system to ensure that we always know which of the 4 "phases" the colour to base conversion system should be in.
ok yes. If I've understood correctly:

You build a colour space consensus. Align colour space reads to it. Then convert the consensus and the first bases of the read to base space (on the fly). If you compare these and you get a mismatch, then you know your consensus has gone out of phase (or there was some really horrible error in the read).

So... when you do detect an error the bases between the last know good initial base, and next known good are in doubt. To get this down to base resolution you'd need coverage==read length.

I think that's a neat trick that could help out a fair bit, though like you say it's not as robust as the SNP trick.

I think there is some probability of error in that first base. It's a single colour change error so based on the data I've seen I'd guess around 2% as for a normal base it's about 6->8%... So you could end up marking a bad/good, good/bad. The resolution would need to be high to avoid this effecting a large number of bases.

Quote:
Originally Posted by jkbonfield View Post
If we have sufficient depth then we'll get the resolution quite high, possibly to the base level (say 25 fold and above). It's not as robust as comparison against a reference sequence and SNP correction as we only have one correcting factor per read rather than per base, but there's sufficient information to use still. This of course assumes that the assembly is correct. Misassemblies would cause problems still.

Rather messy and personally not appealing unless we could see some tangible gain for having to go through the extra hoops.
It sounds like a technique like this would be the right kind of strategy for Solid data. I think it's a work around for the problems caused by 2 colour changes in de novo assembly though, it's not buying you an error correction (which respect to basespace) like with the SNP stuff.

So when your doing your actual assembly you're still left with the single colour change error rate when overlapping. I think with an error rate this high, and short reads you'd be lucky to produce a consensus good enough to work from... maybe if you could filter out a lot of the errors...

Right now I can't see that the Solids are likely to be competitive for de novo, not compared with the GAs. Read lengths would need to be longer and the single colour change error rate lower. Either that or they'd need a throughput advantage of at least an order of magnitude.
new300 is offline   Reply With Quote
Old 01-07-2009, 11:05 AM   #11
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Is an error rate of 6-8 % (single color change) really normal? How is this value calculated and what is the corrisponding error rate for the Illumina? Are the numbers affected by the lack of filtering of empty or mixed beads on the SOLiD and if so would it be better to apply quality filtering on SOLiD data before doing de novo assembly?
Chipper is offline   Reply With Quote
Old 01-07-2009, 12:17 PM   #12
new300
Member
 
Location: northern hemisphere

Join Date: Mar 2008
Posts: 50
Default

Quote:
Originally Posted by Chipper View Post
Is an error rate of 6-8 % (single color change) really normal? How is this value calculated and what is the corrisponding error rate for the Illumina? Are the numbers affected by the lack of filtering of empty or mixed beads on the SOLiD and if so would it be better to apply quality filtering on SOLiD data before doing de novo assembly?
It's what I saw when I did a brute force alignment of the E.Coli data release (http://www.genographia.org/portal/to...rimer.pdf/view). My understanding is that empty beads are filtered early on. There were also no reads with more than 8 errors. This makes me think some filtering had been applied. Some MSc students I was working with also saw similar error rates in the Yorubian dataset.

You also should also be able to calculate the single colour change error rate from the SNP miscall rate. I've seen this quoted as 0.036 which I think should be roughly equivalent to a single colour change error rate of 6%. Those are the only numbers I have to go on, a comprehensive review would be useful.

I think additional filtering would help, it's a trade off between that throughput.

As delivered by device/quoted in throughput numbers the Illumina error rate is around 1%. They apply relatively harsh filtering to remove mixed clusters during primary data analysis. I'd be interesting to see a Solid dataset where filtering had been applied to get the single colour change error rate down to 1%, that would make for a useful comparison.
new300 is offline   Reply With Quote
Old 02-18-2009, 09:52 AM   #13
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

I am still curious as to how SOLiD and Solexa compare apples-to-apples. Both produce short reads, but still not much about how similar or complementary they are!

Met a few at AGBT and still could not find the answers..
bioinfosm is offline   Reply With Quote
Old 02-19-2009, 11:04 AM   #14
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Wasn't there a paper within the last several months which compared all three platforms and basically came up with the conclusion that all three platforms were equally good -- at least on bacteria. The SOLiD may have come out ahead on SNP calling.

I believe the problem is not apples-to-apples but rather the other considerations:

(1) Ease of lab prep.
(2) Cost of running.
(3) Length of reads.
(4) Number of reads.
(5) Which machines my organization will pony up the money for. :-)

My organization has two sequencers -- a 454 and a SOLiD. As a computer guy which do I like better? It depends on the project. Would I like a Solexa? Sure. Supposedly easier chemistry than the SOLiD with longer reads but more expensive to run with fewer reads and not as good SNP calling as the SOLiD. But heck, if the powers that be want to buy us a Solexa and pay for the service contract ... well, I suspect that we find room in our already over-crowded lab for it.

What I really want to see is a paper comparing the sequencing of repetitive eukaryotic organisms (not human!) when given a project with "X" dollars to spend and "Y" weeks to complete it.
westerman is offline   Reply With Quote
Old 02-19-2009, 12:19 PM   #15
new300
Member
 
Location: northern hemisphere

Join Date: Mar 2008
Posts: 50
Default

Quote:
Originally Posted by westerman View Post
Wasn't there a paper within the last several months which compared all three platforms and basically came up with the conclusion that all three platforms were equally good -- at least on bacteria. The SOLiD may have come out ahead on SNP calling.
I remember seeing this but not having the time to read it, do you have the citation?

Quote:
Originally Posted by westerman View Post
I believe the problem is not apples-to-apples but rather the other considerations:

(1) Ease of lab prep.
(2) Cost of running.
(3) Length of reads.
(4) Number of reads.
(5) Which machines my organization will pony up the money for. :-)
Agreed and it's not always a question of purchasing but freebies also get handed out to promote a product. It all muddies the water somewhat.

Quote:
Originally Posted by westerman View Post
My organization has two sequencers -- a 454 and a SOLiD. As a computer guy which do I like better? It depends on the project. Would I like a Solexa? Sure. Supposedly easier chemistry than the SOLiD with longer reads but more expensive to run with fewer reads and not as good SNP calling as the SOLiD. But heck, if the powers that be want to buy us a Solexa and pay for the service contract ... well, I suspect that we find room in our already over-crowded lab for it.
How many raw and aligned reads per run do you get out of your Solid?

Quote:
Originally Posted by westerman View Post
What I really want to see is a paper comparing the sequencing of repetitive eukaryotic organisms (not human!) when given a project with "X" dollars to spend and "Y" weeks to complete it.
I guess what you really want is to look at a variety of sequence structure for a variety of applications (SNP calling, de novo assembly, CNV, structural variant stuff etc.). Be very interesting.

Most of the genome centers seem to be gearing up with Illuminas at the moment. Sanger have 40 odd, WashU 35... I've not seem much hard evidence to backup Solids but then I've mostly worked with Solexa data.
new300 is offline   Reply With Quote
Old 02-19-2009, 01:31 PM   #16
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

Quote:
Originally Posted by bioinfosm View Post
I am still curious as to how SOLiD and Solexa compare apples-to-apples. Both produce short reads, but still not much about how similar or complementary they are!

Met a few at AGBT and still could not find the answers..
It's not easy to compare since throughput changes so fast on both instruments - for example the latest Genome Biology RNA-seq paper used 38 lanes to get 138 M aligned reads which is a number you can get from one SOLiD slide (1/2 run) today. What the current numbers are for the GA-II I do not know. What sort of apples are you interrested in comparing?...
Chipper is offline   Reply With Quote
Old 02-19-2009, 02:09 PM   #17
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

I am interested in the quality of data. Using say 6million 35bp reads on the same sample, which instrument should one prefer, say for SNP calling. From a celegans comparison paper, it looks SOLiD has a slight advantage in calling rare SNP? Does its 2-base encoding really give more accurate results?
bioinfosm is offline   Reply With Quote
Old 02-20-2009, 08:36 AM   #18
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by new300 View Post
How many raw and aligned reads per run do you get out of your Solid?
From a project that I have been working on this week since the data come off the sequencer Monday evening. This is one run. Mate-paired 25-base to a non-human eukaryotic organism. One region/plate.

Raw reads: ~142M

Mapped R3 reads: ~114M for unique & random at 3 mismatches
Mapped F3 reads: ~118M (ditto)

Mapped R3 reads: ~77M for uniquely placed reads at 3 mismatches
Mapped F3 reads: ~75M (ditto)

Paired F3-R3 reads: ~78M

So Approximately 3900 Mbases. (78M times 50 bases).

SNP analysis is currently in progress on the paired reads. From my work with the mapped but not-paired reads we should obtain quite a few SNPs.
westerman is offline   Reply With Quote
Old 02-20-2009, 09:43 AM   #19
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by bioinfosm View Post
I am interested in the quality of data. Using say 6million 35bp reads on the same sample, which instrument should one prefer, say for SNP calling. From a celegans comparison paper, it looks SOLiD has a slight advantage in calling rare SNP? Does its 2-base encoding really give more accurate results?
In theory color-space should give more accurate results for SNP calling. The concept is that it takes two adjacent color space mismatch to indicate a SNP. If you see a single color-space mismatch then you can flag read that as a sequencer error. Compare this to traditional base-space where, when you see a single mismatch, you have no idea if this arises from a sequencer error or a SNP. Depth of coverage can take help resolve the problem but there are limits to that especially for rare SNPs.

In practice the rate of sequencer error could play a major role. Obviously if there is too much sequencer error then too much data will be thrown away and nothing will be found. The SOLiD's error rate may be higher than the Solexa's. I do not have firm numbers on this, however.

Let's do a couple of thought experiments. Say that there is a common SNP that occurs in 50% of the population. Furthermore say that the SOLiD has a 0.5% error rate per base while the Solexa is 1/5 that - 0.1% per base [note that I am just making up those numbers -- the actual rates are probably much different]. If we pool 100 individuals together in a run of 25 mers then -- very roughly since I am doing simple probability here --

The SOLiD run will -- for sequencer errors -- generate 12 - 13 runs with a single mismatch and 0 - 1 runs with adjacent mismatches.

Co-mingled with the above will be 50 runs with 2 adjacent mismatches that represent the SNPs.

So overall there will be about:

44 runs without mismatches -- the non-SNPs
44 runs with adjacent mismatches - the SNPs plus *maybe* 1 error run
12 runs with non-adjacent mismatch(es) -- errors for both non-SNPs and SNPs

When we look at the data we would toss out the non-adjacent mismatch reads as errors. We would then pick up 44 adjacent mismatch runs representing the same SNP and maybe 1 run representing a different (and erroneous) SNP.

For the Solexa there would be:
52 runs with a mismatch(es) -- 50 real SNPs and 2 or maybe 3 runs with errors.
48 runs without mismatches.

Once again it is easy to pick up the true SNP since 50 of the runs all have a mismatch in the same location and the 2 or 3 runs that indicate SNPs are simply errors and could be tossed.

Now ... for the rare variant that occurs in 2% of the population.

The SOLiD has
84 runs with no mismatches
12 runs with non-adjacent mismatch(es)
2 runs with adjacent mismatches and *maybe* 1 adjacent mismatch error run

Those two adjacent mismatches are the real SNP. The errors are simply tossed.

The Solexa has
96 runs with no mismatches
4 (maybe 5) runs with mismatches.

2 of the adjacent mismatches are the real SNP while 2 or 3 are errors.

In neither case does the platform pick up the real SNP unambiguously -- it is hard to do when sequencers generate errors -- but the SOLiD (and color space) does work, in theory, better with the rare variants. It works even better if we assume that the sequencer error is the same as the Solexa's.

Next up: color space and indels. Once my head stops hurting.
westerman is offline   Reply With Quote
Old 03-02-2009, 02:58 PM   #20
new300
Member
 
Location: northern hemisphere

Join Date: Mar 2008
Posts: 50
Default

Quote:
Originally Posted by westerman View Post
So Approximately 3900 Mbases. (78M times 50 bases).
So, I can't really see the throughput advantage of the Solid there. GA1 runs I've seen are around 4Gb. If you look at the short read archive GA2 runs are around 7Gb+ with 35bp reads. For PhiX around 95% of Illumina reads align within 2 errors. For human I think you tend to see about 80%. Those are 35bp reads I believe. There are 50bp reads in the SRA which appear to go up to 14Gb.
new300 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:39 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO