SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Simulating paired-end reads and bowtie alignments droog_22 Bioinformatics 0 02-09-2012 07:53 AM
RNA-Seq: Recent Applications of DNA Sequencing Technologies in Food, Nutrition and Ag Newsbot! Literature Watch 0 08-23-2011 02:00 AM
RNA-Seq: The transcriptomics of sympatric dwarf and normal lake whitefish (Coregonus Newsbot! Literature Watch 0 11-20-2010 03:00 AM
RNA-Seq: Normalization strategies for microRNA profiling experiments: a 'normal' way Newsbot! Literature Watch 0 08-13-2010 08:00 AM
Simulating random ChIP-seq peaks nbr Bioinformatics 0 11-03-2009 01:49 PM

Reply
 
Thread Tools
Old 07-15-2011, 01:38 AM   #1
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default Simulating FastQ libraries for BS-Seq or normal applications using Sherman

We have just made available a FastQ simulation script, termed Sherman, for high-throughput bisulfite (or standard genomic) sequencing datasets. It can generate single-end or paired-end data in both nucleotide-/base-space (such as from the Illumina platform) and color-space (such as from the SOLiD platform).

Sherman was designed to assess the influence of common problems observed in many Next-Gen Sequencing libraries on the primary analysis of BS-Seq data. Thus, it allows the user to introduce various 'contaminants' into the simulated libraries, including basecall errors (following an exponential decay model), SNPs, Illumina adapter fragments and more.

These are the main features:
Generate any number of sequences of any length
Generate either completely random sequences or use genomic sequences (genome can be specified)
Generates single-end or paired-end data with variable fragment sizes
Adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually
Generate directional or non-directional libraries
Generate sequences in base-space or SOLiD color-space format
Adjustable default Phred quality score (Sanger encoding, Phred+33 format)
Sequences can have constant Phred qualities throughout the read or can have quality scores following an exponential decay curve, which will eventually result in basecall errors (note that this is handled slightly different for base- and color-space data)
Introduce a variable number of random SNPs into each read
Introduce a fixed amount of adapter sequence at the 3' end of all sequences
Introduce a variable amount of adapter sequence at various positions at the 3' end of reads

While including the paired-end option, Sherman has received a major overhaul so it should now run much quicker and be less memory-intensive. Initially, Sherman was designed to generate the kinds of library contaminations we were interested in, but if you have any ideas or suggestions which could be implemented (_easily_) we would love to hear from you.

Sherman can be found at www.bioinformatics.bbsrc.ac.uk/projects/
fkrueger is offline   Reply With Quote
Old 09-29-2011, 08:30 AM   #2
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default identical qualities?

Hi, this looks to be quite useful.

I call like:

Code:
./Sherman -n 100000 -l 50 -cr 0 --colorspace --error_rate 1 --genome_folder ~/data/hg19/ --quality 30
If I do the following, I get only 1 line of output:
Code:
$ awk '(NR %2 == 0)' simulated_QV.qual | uniq
e.g. There is no randomness in the quality values.
Is this as intended?

thanks,
-Brent
brentp is offline   Reply With Quote
Old 09-29-2011, 11:12 AM   #3
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Hi Brent,

It is true that all reads have the same quality values at each position, and this is modeled so that on average there is a certain chance, of in your case 1%, of incorporating a sequencing error spread over the entire sequence. A certain degree of randomness is achieved at the point when the error is actually introduced, because this is decided randomly against the Phred score (= probability that a basecall is wrong) for each bp individually.

Hope this isn't too confusing.

Best,
Felix
fkrueger is offline   Reply With Quote
Old 09-29-2011, 11:19 AM   #4
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default

Got it. Thanks for the explanation.
brentp is offline   Reply With Quote
Old 01-09-2012, 07:49 AM   #5
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

We have just released an updated version of Sherman (v0.1.1) which fixes an issue with the simulation of non-directional paired-end data and improves some other minor aspects.
fkrueger is offline   Reply With Quote
Old 09-07-2012, 02:22 AM   #6
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

We have updated Sherman (v0.1.2) so that reads which were simulated from an existing genome carry the genomic coordinates in the sequence ID. This makes it easier to determine the accuracy of different aligners..
fkrueger is offline   Reply With Quote
Old 07-12-2013, 02:11 PM   #7
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

We have released a new version of the bisulfite simulator Sherman (v.0.1.4). This update fixes the following flaw:

During context specific cytosine conversion, until now Sherman assumed that a C at the last position was in CH context. This did however cause a weird blip in the M-bias plots (introduced into the Bismark methylation extractor as of v0.8.0) of simulated data at the end or read 1 and at the start of read 2 whenever the read was actually in CpG context. To account for this, Sherman does now determine the sequence context of the last position in a read correctly.

Sherman is available here: http://www.bioinformatics.babraham.a...jects/sherman/.
fkrueger is offline   Reply With Quote
Old 07-22-2013, 10:29 PM   #8
kerhard
Member
 
Location: Oakland

Join Date: Feb 2011
Posts: 27
Default

Hello,

I'm using Sherman to generate sets of 32 bp genomic sequences for use as random control "libraries" to some transcriptome libraries our lad has made. I compare the distribution of these random "reads" in different annotated genomic categories (how many fall within genes, transposons, etc.) to that of the transcriptome libraries.

So, a question about the --genome_folder option: How random are the sequences generated when this option is chosen? How are, for example, the different 32-mers chosen from the chromosome coordinates given?

This is the command I use:

./Sherman -l 32 -n 51402229 --genome_folder /genome/ZmB73_Refgen/

Just looking at two simulated files generated by using the identical command, I see they're not the same, but I just wanted to get a sense of how different they are.

Thanks,

Karl
kerhard is offline   Reply With Quote
Old 07-23-2013, 12:35 AM   #9
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Hi Karl,
The starting position in the genome is determined by first concatenating all chromosomes into one big long sequence, and then generating random numbers using the Perl rand() function. Using this number it does then first determine which chromosome and starting position this would correspond to, and extract 32bp sequence at this position. So in essence it should be as 'random' as the Perl rand() function is. Hope this helps.
fkrueger is offline   Reply With Quote
Old 07-23-2013, 09:18 AM   #10
kerhard
Member
 
Location: Oakland

Join Date: Feb 2011
Posts: 27
Default

Ah, I was guessing it might be the Perl rand() function generating the coordinates, but wanted to be sure. Thanks very much!
kerhard is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:45 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO