SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > 454 Pyrosequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Using solexa to correct 454 homopolymer errors coldturkey Bioinformatics 21 01-26-2016 01:39 AM
454 Titanium error rate ? 454fungi 454 Pyrosequencing 0 09-14-2010 01:06 AM
Pileup - Filter SNP/Indel linked to 454 homopolymer bardou Bioinformatics 0 06-30-2010 07:17 AM
Homopolymer error in 454 sequencing data Salmon 454 Pyrosequencing 3 02-06-2009 09:08 AM
454 homopolymer errors or???? ian Adams 454 Pyrosequencing 9 12-02-2008 01:46 AM

Reply
 
Thread Tools
Old 04-09-2009, 03:15 AM   #1
joa_ds
Member
 
Location: belgium

Join Date: Dec 2008
Posts: 52
Default 454 homopolymer error rate

Hi 454 analysers,

We are doing resequencing experiments here and are developing our own BLAT and a DB based mapping and SNP discovery pipeline.

Finally we are able to detect SNVs, Isertions, Deletions and InDels. We have already validated the pipeline with random errors and known errors. But now the final part appears to be more tricky. The 1 million dollar question: is it a heterozygous or a homozygous variation.

For example: 400x and 25% or 80% error rate, what do you do with that? In very long stretches of homopolymers, for example 10Cs. Chances are very big that you get 20% fake error of 1 or 2C extra.

Well we are making a mathematical model to determine the cutoff frequencies of error rates at a certain coverage. The higher the coverage is the narrower the band becomes wherein a heterozygous error rate can be, but how narrow?

I know the average error rate of 454 is around 1/1000, but that is not what i need, because Single nuc variations get filtered out in a very early stage (i filter everything that is <20%). The residual errors are either true variations or very frequent errors (such as homopolymers possibly?).

Ok what i need is some kind of homopolymer error rate. I suppose it is linked to the length of the homopolymer, the longer it is, the more probable it is that random errors will occur. Is there a function known that gives the error rate ~ homopol length? I can calculate it myself, but i have some gut feeling this might not be this easy.

Is anyone aware of a good article that describes different error rates of different types of errors? I have the article Pyrobayes: an improved base caller for SNP discovery in pyrosequences, but that only describes general error rates for substitution, deletion and insertion, not for homopolymer or normal.

In the near future we will start with bisulfite treated amplicon sequencing, and with only 3 nucs, there will be even a bigger homopol error rate, and i would like to investigate/model some freqs of certain things upfront so that i can determine a useful coverage.
joa_ds is offline   Reply With Quote
Old 04-09-2009, 05:50 AM   #2
Chema
Junior Member
 
Location: Poznan, Poland

Join Date: Jul 2008
Posts: 6
Default

Perhaps this paper can help you:
http://genomebiology.com/content/pdf...7-8-7-r143.pdf
Chema is offline   Reply With Quote
Old 04-09-2009, 07:53 AM   #3
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

That is very interesting .. do keep us updated on what you observe.
I have been looking at blat for 454 data as well, especially because gsMapper does not have a parameter for adjusting gap penalties.

Quote:
Originally Posted by joa_ds View Post
Hi 454 analysers,

We are doing resequencing experiments here and are developing our own BLAT and a DB based mapping and SNP discovery pipeline.

Finally we are able to detect SNVs, Isertions, Deletions and InDels. We have already validated the pipeline with random errors and known errors. But now the final part appears to be more tricky. The 1 million dollar question: is it a heterozygous or a homozygous variation.

For example: 400x and 25% or 80% error rate, what do you do with that? In very long stretches of homopolymers, for example 10Cs. Chances are very big that you get 20% fake error of 1 or 2C extra.

Well we are making a mathematical model to determine the cutoff frequencies of error rates at a certain coverage. The higher the coverage is the narrower the band becomes wherein a heterozygous error rate can be, but how narrow?

I know the average error rate of 454 is around 1/1000, but that is not what i need, because Single nuc variations get filtered out in a very early stage (i filter everything that is <20%). The residual errors are either true variations or very frequent errors (such as homopolymers possibly?).

Ok what i need is some kind of homopolymer error rate. I suppose it is linked to the length of the homopolymer, the longer it is, the more probable it is that random errors will occur. Is there a function known that gives the error rate ~ homopol length? I can calculate it myself, but i have some gut feeling this might not be this easy.

Is anyone aware of a good article that describes different error rates of different types of errors? I have the article Pyrobayes: an improved base caller for SNP discovery in pyrosequences, but that only describes general error rates for substitution, deletion and insertion, not for homopolymer or normal.

In the near future we will start with bisulfite treated amplicon sequencing, and with only 3 nucs, there will be even a bigger homopol error rate, and i would like to investigate/model some freqs of certain things upfront so that i can determine a useful coverage.
bioinfosm is offline   Reply With Quote
Old 04-09-2009, 08:07 AM   #4
joa_ds
Member
 
Location: belgium

Join Date: Dec 2008
Posts: 52
Default

hi, that paper is interesting indeed. I have it here on the desk, but it is quite outdated, I guess basecalling has already improved since then and it is not quite what i am looking for.

If an error occurs, they describe the chance of being a homopolymer error or not.

I am thinking about the other way aruond. Obeserve a variation, is it in a homopolymer? if so, what would be the chance of a random error in a homopolymer and use that data to say "false error or true error". I am trying my own approach to guess error rates, but any input is useful.

I'll keep you updated...
joa_ds is offline   Reply With Quote
Old 06-01-2009, 06:03 AM   #5
yannickwurm
Junior Member
 
Location: Queen Mary University London, UK

Join Date: Jan 2009
Posts: 5
Default

Hi y'all

do you have any info on how the 454 basecalling software has improved?
Ie if I have data thats a year old, should I get out the raw image files and rerun the basecalling using the latest software?

Thanks & regards,

yannick
yannickwurm is offline   Reply With Quote
Old 07-24-2009, 08:11 AM   #6
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

anyone with experience on the last 2 posts?
joa_ds - did you make some observations that can be shared here?
bioinfosm is offline   Reply With Quote
Old 07-24-2009, 10:12 AM   #7
hlu
Member
 
Location: Branford, Connecticut

Join Date: Jan 2009
Posts: 32
Default

Quote:
Originally Posted by bioinfosm View Post
anyone with experience on the last 2 posts?
joa_ds - did you make some observations that can be shared here?

The paper is quite outdated. For one thing, the paper is about GS-20, which is not compatible with current FLX and Titanium platform.

Titanium and FLX have different error profile than GS20, and much lower error rate than GS20.

My understanding is that Titanium and FLX basecall software are not compatible with GS20 raw images.
hlu is offline   Reply With Quote
Old 12-17-2009, 06:55 AM   #8
avilella
Member
 
Location: uk

Join Date: Mar 2009
Posts: 34
Default

Has anyone got any info for the latest batch of 454 runs (~260bp)?

Quote:
Originally Posted by hlu View Post
The paper is quite outdated. For one thing, the paper is about GS-20, which is not compatible with current FLX and Titanium platform.

Titanium and FLX have different error profile than GS20, and much lower error rate than GS20.

My understanding is that Titanium and FLX basecall software are not compatible with GS20 raw images.
avilella is offline   Reply With Quote
Old 12-18-2009, 10:09 AM   #9
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,195
Default

Quote:
Originally Posted by avilella View Post
Has anyone got any info for the latest batch of 454 runs (~260bp)?
I don't, but thought I should mention that Titanium chemistry reads have modal lengths in the 400-500 base range.

--
Phillip
pmiguel is offline   Reply With Quote
Old 12-18-2009, 10:11 AM   #10
avilella
Member
 
Location: uk

Join Date: Mar 2009
Posts: 34
Default

I haven't seen any RNA-seq reads of 400-500 bps in NCBI SRA, but I have seen the ones that are 260bp.
avilella is offline   Reply With Quote
Old 12-18-2009, 10:33 AM   #11
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,195
Default

Quote:
Originally Posted by avilella View Post
I haven't seen any RNA-seq reads of 400-500 bps in NCBI SRA, but I have seen the ones that are 260bp.
That is interesting. I wonder why. Does the SRA have a maximum read length it allows? Maybe you have to dump Titanium reads into dbEST or dbGSS?

By the way, I can assure you that Titanium read lengths really do tend to have a peak in the 400-500 base range--if all goes well.

--
Phillip
pmiguel is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:35 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO