Seqanswers Leaderboard Ad

**Lluc** · 06-23-2011, 12:36 AM

wgsim models errors as uniformly distributed along the reads, and therefore assigns the same base quality to all bases, which is no realistic. I haven't tried MetaSim, but I think it allows you to use empirical error models. You may want to try it:

404 Error | Universität Tübingen

http://ab.inf.uni-tuebingen.de/software/metasim/

**fkrueger** · 06-23-2011, 02:05 AM

Hi oiiio,

We have recently written a fastq silmulator which has the option of generating reads with an error rate following an exponential decay model. So if you simulate an error rate of say 1% over the entire read, the first cycles (possibly 50-70) will have hardly any errors, however the quality will then drop more sharply towards the last cycles, resulting in an overall error rate of 1% per base per read.

The simulator was originally written for BS-Seq data but it works just as well for normal genomic data. Currently it only simulates single-end reads and features the following options:

- generate any number of sequences
- generate sequences of any length
- generate either completely random sequences or use genomic sequences (can be specified)
- adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually (default: 100)
- generate directional or non-directional libraries (only relevant for BS-Seq)
- write sequence out in base space or ABI color space format
- adjustable default Phred quality score (Sanger encoding, Phred+33) (default: 40)
- sequences can have a constant Phred quality throughout the read (with default quality)
- introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
- sequences can have quality scores following an exponential decay curve. The overall error rate for sequences of varying length follows the calculated error model, and the overall error rate can be specified by the user. For example, a 0.1% error rate will eventually harbour 0.1% SNPs resembling 'real' data error curves (cf. introducing a fixed number of SNPs per sequence).
- introduce a fixed amount of adapter sequence at the 3' end of all sequences. Available for all error models.
- introduce a variable amount of adapter sequence at various positions at the 3' end of reads. For this the user can specify a mean insert size of their library, e.g. 150bp. The simulator then calculates a normal distribution of fragment sizes around this mean, and introduces variable bp of adapter sequence into the reads if the fragment size was smaller than the read length. Available for all error models.
- introduce a variable percentage of adapter sequence (full read length) as contamination. Available for all error models.

One more word to the error model. As it stands, the error model will be applied to all reads uniformly, which is probably not exactly what a real dataset would look like. We have therefore generated a couple of different test data sets with various error levels (e.g. 0%, 0.1%, 0.2%, 0.5%, 1%, 2% and 5% errors (and thus miscalled bases) and simply concatenated the files to produce a silghtly more realistic dataset.

Most of the features have been tested to be working correctly with FastQC and by various other means, just let me know if you are interested.

**oiiio** · 06-23-2011, 12:33 PM

Thanks for the replies. Unfortunately, I really need a simulator that can do paired-end data, although yours sounds like a good tool.

I was looking at the MetaSim program, and I do not see an option in the 'new project' parameters that allows for Illumina data. Does anyone know how to enable this option?

Additionally, I found a simulator called simNGS (http://www.ebi.ac.uk/goldman-srv/simNGS/) that can do Illumina data. This looks to be the program that I need, but I'm not really sure what parameters should be used for 100bp reads and a realistic error model. Any suggestions?

**mdk308** · 07-20-2012, 10:52 AM

Originally posted by fkrueger View Post

introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).

i am interested in this program but i am wondering: doesn't the above statement conflict with your statement about an error rate that varies by position? or maybe you mean that all "miscalls" have the same Q score?

**fkrueger** · 07-20-2012, 10:59 AM

Our simulator either uses the error model and introduces errors according to the error probability. Alternatively, you can introduce a fixed number of errors per read for which the quality scores will be kept constant. We have used this to assess the influence of 1,2,3 etc. errors on certain mapping conditions, as this is not easy to tell if you use an error model.

Feel free to take a look here: http://www.bioinformatics.babraham.a...jects/sherman/.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Looking for the right WGS simulator

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News