SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > SOLiD



Similar Threads
Thread Thread Starter Forum Replies Last Post
De novo transcriptome quality metrics? LizBent De novo discovery 7 06-25-2015 04:56 AM
PubMed: Mauve assembly metrics. Newsbot! Literature Watch 0 01-04-2012 03:10 AM
How to get assembly metrics from (AMOS) minimus and minimo assemblies? Estefania De novo discovery 1 12-17-2011 07:11 AM
Problem with QC metrics froggins Bioinformatics 2 07-27-2011 10:50 AM
Error message from QC metrics froggins The Pipeline 1 01-05-2011 07:53 AM

Reply
 
Thread Tools
Old 08-27-2008, 02:39 PM   #1
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Default Metrics for usability of a SOLiD dataset

The following derives largely from an email I sent to some "fellow travelers" before I heard of this web site.

Consider 3 SOLiD runs we did, all genomic DNA (fragment, 35mers, single flowcell, 1 region): Arabidopsis (Ath), Wheat and Honeybee (Bee37). From various measures (satays, etc.) they look good. But what does that mean?

The standard AB metric appears to be "mappable reads"--defined as number of reads that align with 3 or less mismatches with the reference sequence. That doesn't work very well as a measure of the usability of any of our data sets[1].

One metric we use for Sanger sequencing is "how good does the data set think it is?", or more often "how good does the program phred think the data set is?" as recorded in quality values for each base. While not perfect, it gives you a vague idea of where you stand. That is, the quality values for each base call in the quality file gives an estimate of the probability that the base call is wrong. Add them up in the right way for a read and you have an estimate of the number of errors for that read.

That sounds like a useful metric to me. I wrote a simple perl script to take a quality file and print an error/read histogram. That is, number of reads in each category. The category is 0-1 errors/read, 1-2 errors/read, etc:

http://www.genomics.purdue.edu/~pmig...ist_solid.perl

It is very slow on 150 million+ reads. So I just ran it on the first million reads from each data set.

What may be a useful metric is number of beads/reads/bases at some given number of expected errors/read or less. I chose 3 and 6 errors per read for the table below. This would correspond to a perfect reference sequence against reads with 3 or 6 mismatches set. More about why I chose "3" and "6" below.
http://www.genomics.purdue.edu/~pmig...uns_table.html



Here note that the runs have quite different characteristics. First they have different numbers of usable beads--from the standard 150 million up to 275 million for Ath. Also their error frequencies per read are different.


Why "3" and "6" errors (miscalls)? "3" is in the standard definition of "mappable". Since originally it was designed for 25mers, I wanted to get to the same chance of mis-mapping a 35mer read. A simplistic measure of this would be aligning a random 35mer against a 35 base segment of reference sequence. Presuming equal A,C,G and T composition (which is roughly true in E. coli, but not in Wheat, Honeybee nor Arabidopsis) I think the formula for calculating this would be (don't trust me on this though...)


p= sum from 0 to m of ( (1/4)^n * (3/4)^(n-m) * (n! / (m! * (n-m)!)) )

where "m" is number of allowed mismatches, "n" is the read length and "p" is the probability for a single comparison for a random (spurious) match.

The number I get for 3 mismatches at 25 bases is 5.8E-11. The number I get for 7 mismatches at 35 bases is 1.3E-11. But given that genome compositions I am likely to see will perform worse than these ideal situations, I drop down to 6 mismatches.

A word about quality here. A trimmed Sanger read with 6 errors per 35 bases is terrible quality. However SOLiD reads are dual base encoded, so they should perform better for generating accurate consensuses than single base encoded sequence reads. On the other hand, the quality values I'm dealing with here are generated by the SOLiD base caller. Do they accurately reflect the real chance of miscall?

I don't know. (I would be very interested to hear from anyone who does know.) It would require quite a bit of work to test. I presume Applied Biosystems has tuned their base caller to give fairly accurate quality values. But I haven't verified this.

In the absence of an accurate reference sequence for any given project, I think this gives some information at least on how useful for a given purpose a data set will be. Or at least a general sense of the overall quality of a data set.

Finally, I mentioned to some of you that I suspected that the error rate for the last 10 bases of a 35mer might be much (like 5x) higher than the rest of the read. If these quality values are to be believed, errors are more likely in the last 10 bases than the first 10 bases, but not drastically so. For the Arabidopsis data set the mean number of projected errors is 1.4 base out of 10 whereas it is 2.3 bases out of 10 for the last 10 bases. That includes all the really bad reads, in the data set, though.

Phillip


[1]The problem is getting a valid reference sequence. Arabidopsis-Col-0 is sequenced, most of our Arabidopsis sequence derives from Arabidopsis-Ler (not sequenced). Wheat is not sequenced. Honeybee has a draft sequence, but honeybee is very heterozygous and our queen bee would have large stretches of her genome from an African source. (The Honeybee reference is European.)
pmiguel is offline   Reply With Quote
Old 09-29-2008, 12:57 PM   #2
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,358
Default

As far as I understand it...your script calculates single colorspace errors, right? Rather than "miscalls" in true basespace?
ECO is offline   Reply With Quote
Old 09-30-2008, 07:23 AM   #3
new300
Member
 
Location: northern hemisphere

Join Date: Mar 2008
Posts: 50
Default

I guess it's application dependent. What are you intended to do with SOLiD reads without a reference? I would have thought that with short color space reads there's little you can do but SNP calling against a reference, but I could be wrong.

If you're aligning to a reference, any reference I would have thought it would make sense to calculate the error rate against this.
new300 is offline   Reply With Quote
Old 10-02-2008, 03:29 PM   #4
snetmcom
Senior Member
 
Location: USA

Join Date: Oct 2008
Posts: 158
Default

You may find this useful, but it's falling into many of the pitfalls of non Solid informatics people.
snetmcom is offline   Reply With Quote
Old 10-04-2008, 09:33 AM   #5
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,318
Default

Quote:
Originally Posted by ECO View Post
As far as I understand it...your script calculates single colorspace errors, right? Rather than "miscalls" in true basespace?
Yes. Except it estimates the number of errors per read based on the quality values assigned by the SOLiD base(color) caller.

For example, if a read is 35 bases and each base had a quality value of 10, then that is a 10% chance of error per base. So the estimated number of miscalls would be 3.5 =(0.1*35). But if each base had a quality value of 20, the estimated number of miscalls for that read would be 0.3 =(0.01*35).

Of course normal reads will have different quality values for each base. To estimate the number of miscalls, the script just adds up the estimated chance of a miscall for each base.

The major pitfall here is that I have no idea whether the SOLiD base caller accurately predicts its own error rate. I gather that the SOLiD base caller is tuned on mappable reads (those with 3 errors or less). Should be possible to check how it does on reads mapped with up to 6 errors against a reference sequence without a lot of redundant/low complexity segments. But I have not done this.

--
Phillip
pmiguel is offline   Reply With Quote
Old 10-13-2008, 11:41 PM   #6
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,358
Default

Quote:
Originally Posted by snetmcom View Post
You may find this useful, but it's falling into many of the pitfalls of non Solid informatics people.
I'd love to hear more on that line of thinking....
ECO is offline   Reply With Quote
Reply

Tags
abi, quality values, solid

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:12 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO