Seqanswers Leaderboard Ad

**d17** · 03-09-2011, 09:06 AM

Take a look at the dwgsim utility in Nils Homer's DNA Analysis package at dnaa.sourceforge.net. It does paired-end reads by default but you could just ignore the file of second reads in each pair to obtain single-end reads.

**MBender** · 03-09-2011, 02:46 PM

Looks promising, but it seems that dwgsim also does not estimate the base quality scores. They seem to be constant, or is there an option to include estimations of the quality scores, that i just overlooked?

**d17** · 03-09-2011, 03:06 PM

I'm not sure exactly what you mean by "estimate base quality scores". You don't need to estimate quality scores for simulated data as you know the true quality scores. That is, you know the true base error rate. As an example, if you simulate reads using the dwgsim option -e 0.001, all bases will have quality values of "$", which translates to a Phred score of 30, which represents a probability of error of 0.001 (as specified in the call to dwgsim).

**MBender** · 03-09-2011, 03:18 PM

Okay, i think i get it. Thanks a lot. The option -e x just assigns a uniform error rate x, and subsequently a uniform base quality score (Phred score) to each base, whereas the option -a x1-x2 assigns uniformly distributed range of error rates between x1 and x2 to each base.
Cool, now i can start my simulations

. Thanks again

**fkrueger** · 03-10-2011, 01:25 AM

Hi MBender,

it seems you already found a tool that suits your need, I just wanted to mention that we've written a simulator - originally for BS-Seq data but it works just as well for normal genomic data - that might be of use to you. It currently has the following features:

- generate any number of sequences (default: 1000000)
- generate sequences of any length
- generate either completely random sequences or use specified genomic sequences
- adjustable bisulfite conversion rate from 0-100% of all Ts (default: 100)
- generate directional or non-directional libraries
- write sequence out in base space or ABI color space format
- adjustable default Phred quality score (Sanger encoding) (default: 40)
- sequences can have a constant Phred quality throughout the read (with default quality)
- introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
- sequences can have quality scores following an exponential decay curve. The overall error rate for sequences of varying length follows the calculated error model, and the overall error rate can be specified by the user. For example, a 0.1% error rate will eventually harbour 0.1% SNPs resembling 'real' data error curves (cf. introducing a fixed number of SNPs per sequence).
- introduce a fixed amount of adapter sequence at the 3' end of all sequences. Available for all error models.
- introduce a variable amount of adapter sequence at various positions at the 3' end of reads. For this the user can specify a mean insert size of their library, e.g. 150bp. The simulator then calculates a normal distribution of fragment sizes around this mean, and introduces variable bp of adapter sequence into the reads if the fragment size was smaller than the read length. Available for all error models.
- introduce a variable percentage of adapter sequence (full read length) as contamination. Available for all error models.

Most the features have been tested to be working correctly with FastQC and by various other means, let me know if you are still interested.

**MBender** · 03-10-2011, 02:39 AM

Hi Felix,

thanks for your reply, your simulation software sounds very useful to me, mainly the features about introducing SNPs, quality scores and bisulfite conversion, in descending order.
First of all it would be interesting to know how you manage the introduction of SNPs. Assuming you are sequencing a population of cells from a certain tissue of a diploid species, then you would expect to find at certain positions heterozygeous SNPs. You said, that your program is able to introduce a variable number of SNPs into each read. Does this mean, that you introduce a SNP at some (reference-) positions and each read covering this position is of variant A or B?
Next you said, that sequences can have quality scores following an exponential decay curve. Does this mean, that let's say at the 3' end of the sequence you will have a high score and in direction of the 5' end the score of the bases decays? Or is it more like an error probability for each base that follows an exponential distribution?
Last but not least you said that your program is able to simulate bisulfite converted reads at a certain conversion rate. Does it just randomly introduce conversions of unmethylated C's into T's at a certain rate, or is it able to introduce these conversions following a predefined distribution throughout the reference sequence. Even more intresting would be to introduce bisulfite conversions with some preference for one allele (defined by the SNP on this read, if there is any).

I'm concerned with setting up a pipeline for investigating allele-specific epigenetic features, therefore i need a simulation program that fullfils some needs. I would be glad, if you could provide me some more informations, and maybe the source of your tool.

**fkrueger** · 03-10-2011, 03:26 AM

We wanted to look at mapping efficiencies depending on the number of SNPS (with high quality). Therefore as it stands the simulator will introduce a user-specified number of SNPs per read, but the SNPs themselves are random. With a little bit of extra work I suppose one could also generate sequences with true SNPs, for example by reading in a vcf SNP file while storing the locations and SNP bases. Or if you are working with a mouse strain and have already made a genome for the other strain harbouring all SNPs you could just read in two genomes to start with and select one of them at random to extract the sequences. This would certainly be much quicker. If you are not working with a known genome one could also simulate SNP positions (a certain number of them). Also, the option to introduce SNPs is currently mutually exclusive with the introduction of errors (so that we can look at them separately), but the error model could easily be also applied to to sequences carrying SNPs if you were interested in such a feature.

The error probability follows an exponential curve, whereby the values at the 5' end of the sequence are very close to the default Phred quality score (40 but this can be adjusted) and they drop of (more or less drastically) towards the 3' end of reads. In effect, this means that more errors will be introduced towards the end of all sequences. If you can specify an overall per base error rate of, say 0.5%, you will hardly ever see errors at the high-quality 5' end however might see a few towards the 3' end. I just ran a few simulations and then looked at the error distributions with FastQC to see if they look like real datasets. Empirically, I would say that a typical 40bp Illumina run these days can be simulated with an error rate of ~0.1%, a 104 bp run probably more with a rate of ~1%.

The bisulfite conversion works currently either with a uniform conversion rate for all Cs (converting at each position according to the specified conversion rate) or also independently for CpG and non-CpG context, each of which can be specified with a float number, e.g. --CG_conversion 32 --CH_conversion 99.8 (which means methylation rates of ~68% for CpG context and 0.2% for non-CpG context). This would naturally also be applied to alleles carrying SNPs which involve Cs, but as it stands the tool does not apply certain conversions for different regions in the reference genome, such as having low methylation levels for CpGs within CpG islands. This could certainly be accomplished, but I guess it would require quite a bit of extra work.

I would possibly be able to implement a solution concerning the SNPs for you, however I don't know if I could fit an implementation of specific bs-conversion distributions into my current work schedule.... The tool is written in Perl, so you could of course use the code and write a solution yourself. If you could just send me an email to [email protected] saying what solution you would prefer at the moment I'd be happy to share the code.

**nilshomer** · 03-10-2011, 09:29 AM

Note that the simulator in DNAA also includes tools to assess the mapping error rates (dwgsim_eval) and SNP/indel calling sensitivity and specificity (dwgsim_pileup_eval.pl).

**jkbonfield** · 03-11-2011, 01:38 AM

One strategy for generating error profiles is simply to copy them directly from an existing data set, assuming you can find one with the same length.

Flat probabilities or uniformly distributed probabilities may not be sufficient, depending on the use they're put to. Eg for SNP detection it may be fine, but for alignment or assembly perhaps not.

The reason is due to the way many algorithms work looking for fixed kmer seeds. Errors randomly distributed throughout your read means it's quite likely to get an error in the middle, but if they're clustered towards the end due to some natural decay process of the technology (eg Illumina) then with the same overall error rate you're still more likely to be able to match with the same kmer length.

James

PS. That said my current pet peeve is assemblers that just assume you can use a huge kmer and claim you need 100 fold depth instead to compensate. Sure if you go deep enough sooner or later you'll find reads with 70bp in a row with no errors, but frankly I'd rather use an algorithm that doesn't require me to toss out most of my data and only work on the top qualiy stuff. Sequencing is cheap now, but it doesn't mean we should waste it on sloppy algorithms! :P

**nilshomer** · 03-11-2011, 10:10 AM

Another issue is that you don't know how mappers will do on random DNA or very high error data. If a mapper does an exhaustive search, then it might take a long time to figure out a random read does not map. It may even try to pigeonhole the random read onto the reference. Keep that in mind when simulating reads.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Software for generating "synthetic" reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News