Go Back   SEQanswers > Literature Watch

Similar Threads
Thread Thread Starter Forum Replies Last Post
hg19 genome reference for short read mapping yh253 Bioinformatics 4 12-29-2013 10:11 PM
raw sequence short read data sweet_dna_girl Bioinformatics 4 02-15-2012 11:42 PM
PubMed: Reference-free validation of short read data. Newsbot! Literature Watch 0 02-22-2011 12:00 PM
Short read benchmark data GerryB Bioinformatics 5 11-27-2010 03:07 PM
Paired end Short read data SS1234 Bioinformatics 6 06-09-2010 02:16 PM

Thread Tools
Old 09-23-2010, 10:49 AM   #1
Senior Member
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default Reference-Free Validation of Short Read Data

Reference-Free Validation of Short Read Data

Jan Schröder1,2*, James Bailey1,2, Thomas Conway2, Justin Zobel1,2

1 Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Victoria, Australia, 2 NICTA Victoria Research Laboratory, Parkville, Victoria, Australia

High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of some of these biases are not always known. Accurate assessment of bias is required for experimental quality control, genome assembly, and interpretation of coverage results. An additional challenge is that, for new genomes or material from an unidentified source, there may be no reference available against which the reads can be checked.

We propose analytical methods for identifying biases in a collection of short reads, without recourse to a reference. These, in conjunction with existing approaches, comprise a methodology that can be used to quantify the quality of a set of reads. Our methods involve use of three different measures: analysis of base calls; analysis of k-mers; and analysis of distributions of k-mers. We apply our methodology to wide range of short read data and show that, surprisingly, strong biases appear to be present. These include gross overrepresentation of some poly-base sequences, per-position biases towards some bases, and apparent preferences for some starting positions over others.

The existence of biases in short read data is known, but they appear to be greater and more diverse than identified in previous literature. Statistical analysis of a set of short reads can help identify issues prior to assembly or resequencing, and should help guide chemical or statistical methods for bias rectification.
krobison is offline   Reply With Quote
Old 09-23-2010, 05:40 PM   #2
Senior Member
Location: Boston

Join Date: Feb 2008
Posts: 693

This is an interesting paper, and it would be more interesting if the authors could stratified the plots by read quality. I guess what is happening is the base caller makes consistent errors on weak bases. In addition, part of the bias comes from the GC bias.
lh3 is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 05:10 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO