Syndicated from PubMed RSS Feeds
Related Articles Probabilistic base calling of Solexa sequencing data.
BMC Bioinformatics. 2008 Oct 13;9(1):431
Authors: Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F
ABSTRACT: BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSIONS: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package (Rolexa) is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.
PMID: 18851737 [PubMed - as supplied by publisher]
More...
Related Articles Probabilistic base calling of Solexa sequencing data.
BMC Bioinformatics. 2008 Oct 13;9(1):431
Authors: Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F
ABSTRACT: BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSIONS: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package (Rolexa) is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.
PMID: 18851737 [PubMed - as supplied by publisher]
More...