Originally posted by bioinfosm
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
-
I am interested in the quality of data. Using say 6million 35bp reads on the same sample, which instrument should one prefer, say for SNP calling. From a celegans comparison paper, it looks SOLiD has a slight advantage in calling rare SNP? Does its 2-base encoding really give more accurate results?--
bioinfosm
Comment
-
Originally posted by new300 View PostHow many raw and aligned reads per run do you get out of your Solid?
Raw reads: ~142M
Mapped R3 reads: ~114M for unique & random at 3 mismatches
Mapped F3 reads: ~118M (ditto)
Mapped R3 reads: ~77M for uniquely placed reads at 3 mismatches
Mapped F3 reads: ~75M (ditto)
Paired F3-R3 reads: ~78M
So Approximately 3900 Mbases. (78M times 50 bases).
SNP analysis is currently in progress on the paired reads. From my work with the mapped but not-paired reads we should obtain quite a few SNPs.
Comment
-
Originally posted by bioinfosm View PostI am interested in the quality of data. Using say 6million 35bp reads on the same sample, which instrument should one prefer, say for SNP calling. From a celegans comparison paper, it looks SOLiD has a slight advantage in calling rare SNP? Does its 2-base encoding really give more accurate results?
In practice the rate of sequencer error could play a major role. Obviously if there is too much sequencer error then too much data will be thrown away and nothing will be found. The SOLiD's error rate may be higher than the Solexa's. I do not have firm numbers on this, however.
Let's do a couple of thought experiments. Say that there is a common SNP that occurs in 50% of the population. Furthermore say that the SOLiD has a 0.5% error rate per base while the Solexa is 1/5 that - 0.1% per base [note that I am just making up those numbers -- the actual rates are probably much different]. If we pool 100 individuals together in a run of 25 mers then -- very roughly since I am doing simple probability here --
The SOLiD run will -- for sequencer errors -- generate 12 - 13 runs with a single mismatch and 0 - 1 runs with adjacent mismatches.
Co-mingled with the above will be 50 runs with 2 adjacent mismatches that represent the SNPs.
So overall there will be about:
44 runs without mismatches -- the non-SNPs
44 runs with adjacent mismatches - the SNPs plus *maybe* 1 error run
12 runs with non-adjacent mismatch(es) -- errors for both non-SNPs and SNPs
When we look at the data we would toss out the non-adjacent mismatch reads as errors. We would then pick up 44 adjacent mismatch runs representing the same SNP and maybe 1 run representing a different (and erroneous) SNP.
For the Solexa there would be:
52 runs with a mismatch(es) -- 50 real SNPs and 2 or maybe 3 runs with errors.
48 runs without mismatches.
Once again it is easy to pick up the true SNP since 50 of the runs all have a mismatch in the same location and the 2 or 3 runs that indicate SNPs are simply errors and could be tossed.
Now ... for the rare variant that occurs in 2% of the population.
The SOLiD has
84 runs with no mismatches
12 runs with non-adjacent mismatch(es)
2 runs with adjacent mismatches and *maybe* 1 adjacent mismatch error run
Those two adjacent mismatches are the real SNP. The errors are simply tossed.
The Solexa has
96 runs with no mismatches
4 (maybe 5) runs with mismatches.
2 of the adjacent mismatches are the real SNP while 2 or 3 are errors.
In neither case does the platform pick up the real SNP unambiguously -- it is hard to do when sequencers generate errors -- but the SOLiD (and color space) does work, in theory, better with the rare variants. It works even better if we assume that the sequencer error is the same as the Solexa's.
Next up: color space and indels. Once my head stops hurting.
Comment
-
Originally posted by westerman View PostSo Approximately 3900 Mbases. (78M times 50 bases).
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 11:49 AM
|
0 responses
15 views
0 likes
|
Last Post
by seqadmin
Yesterday, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Comment