SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
WGS in one region, Amplicons in another nickloman 454 Pyrosequencing 13 07-04-2014 12:12 AM
bioinformatics training for WGS 454 and Illumina hmmngs Bioinformatics 2 09-21-2011 08:01 AM
Celera WGS requires paired data? k-gun12 Bioinformatics 0 03-11-2011 10:40 AM
Nexgen simulator. aloliveira General 1 02-14-2011 08:57 AM
Celera Assembler (WGS) - splice site file? dan Bioinformatics 4 09-28-2009 02:56 AM

Reply
 
Thread Tools
Old 06-22-2011, 09:25 PM   #1
oiiio
Senior Member
 
Location: USA

Join Date: Jan 2011
Posts: 105
Default Looking for the right WGS simulator

I want to simulate WGS with 100 bp reads and apply an error model that most accurately reflects the reality of sequencing on an Illumina HiSeq. Does anyone have any suggestions about which programs (and paramters) could do this?
oiiio is offline   Reply With Quote
Old 06-23-2011, 12:36 AM   #2
Lluc
Member
 
Location: Barcelona

Join Date: Aug 2010
Posts: 12
Default

wgsim models errors as uniformly distributed along the reads, and therefore assigns the same base quality to all bases, which is no realistic. I haven't tried MetaSim, but I think it allows you to use empirical error models. You may want to try it:

http://ab.inf.uni-tuebingen.de/software/metasim/
Lluc is offline   Reply With Quote
Old 06-23-2011, 02:05 AM   #3
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 624
Default

Hi oiiio,

We have recently written a fastq silmulator which has the option of generating reads with an error rate following an exponential decay model. So if you simulate an error rate of say 1% over the entire read, the first cycles (possibly 50-70) will have hardly any errors, however the quality will then drop more sharply towards the last cycles, resulting in an overall error rate of 1% per base per read.

The simulator was originally written for BS-Seq data but it works just as well for normal genomic data. Currently it only simulates single-end reads and features the following options:

- generate any number of sequences
- generate sequences of any length
- generate either completely random sequences or use genomic sequences (can be specified)
- adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually (default: 100)
- generate directional or non-directional libraries (only relevant for BS-Seq)
- write sequence out in base space or ABI color space format
- adjustable default Phred quality score (Sanger encoding, Phred+33) (default: 40)
- sequences can have a constant Phred quality throughout the read (with default quality)
- introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
- sequences can have quality scores following an exponential decay curve. The overall error rate for sequences of varying length follows the calculated error model, and the overall error rate can be specified by the user. For example, a 0.1% error rate will eventually harbour 0.1% SNPs resembling 'real' data error curves (cf. introducing a fixed number of SNPs per sequence).
- introduce a fixed amount of adapter sequence at the 3' end of all sequences. Available for all error models.
- introduce a variable amount of adapter sequence at various positions at the 3' end of reads. For this the user can specify a mean insert size of their library, e.g. 150bp. The simulator then calculates a normal distribution of fragment sizes around this mean, and introduces variable bp of adapter sequence into the reads if the fragment size was smaller than the read length. Available for all error models.
- introduce a variable percentage of adapter sequence (full read length) as contamination. Available for all error models.


One more word to the error model. As it stands, the error model will be applied to all reads uniformly, which is probably not exactly what a real dataset would look like. We have therefore generated a couple of different test data sets with various error levels (e.g. 0%, 0.1%, 0.2%, 0.5%, 1%, 2% and 5% errors (and thus miscalled bases) and simply concatenated the files to produce a silghtly more realistic dataset.

Most of the features have been tested to be working correctly with FastQC and by various other means, just let me know if you are interested.
fkrueger is offline   Reply With Quote
Old 06-23-2011, 12:33 PM   #4
oiiio
Senior Member
 
Location: USA

Join Date: Jan 2011
Posts: 105
Default

Thanks for the replies. Unfortunately, I really need a simulator that can do paired-end data, although yours sounds like a good tool.

I was looking at the MetaSim program, and I do not see an option in the 'new project' parameters that allows for Illumina data. Does anyone know how to enable this option?

Additionally, I found a simulator called simNGS (http://www.ebi.ac.uk/goldman-srv/simNGS/) that can do Illumina data. This looks to be the program that I need, but I'm not really sure what parameters should be used for 100bp reads and a realistic error model. Any suggestions?
oiiio is offline   Reply With Quote
Old 07-20-2012, 10:52 AM   #5
mdk308
Junior Member
 
Location: Baltimore, MD

Join Date: Jul 2012
Posts: 8
Default

Quote:
Originally Posted by fkrueger View Post
introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
i am interested in this program but i am wondering: doesn't the above statement conflict with your statement about an error rate that varies by position? or maybe you mean that all "miscalls" have the same Q score?
mdk308 is offline   Reply With Quote
Old 07-20-2012, 10:59 AM   #6
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 624
Default

Our simulator either uses the error model and introduces errors according to the error probability. Alternatively, you can introduce a fixed number of errors per read for which the quality scores will be kept constant. We have used this to assess the influence of 1,2,3 etc. errors on certain mapping conditions, as this is not easy to tell if you use an error model.

Feel free to take a look here: http://www.bioinformatics.babraham.a...jects/sherman/.
fkrueger is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:36 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO