Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Software for generating "synthetic" reads

    Hi everyone,

    one day of intensively searching the web for tools that simulate next generation sequencing reads ended in the following result. There is a lot of software, but as always none of them is able to generate the data one needs. What i am looking for is some code that is able to
    - produce Illumina (single end) reads from a reference sequence,
    - give the user the ability to adjust the lengths of the reads and the read coverage,
    - estimate the base quality scores (this is very important), and
    - output the data in an appropriate format, like fastq

    The first tool i tested was RMAP (http://rulai.cshl.edu/rmap/), which is able to produce single and paired end reads at variable lengths and scores. I tried it, but at the end it turned out, that the scores are constant.
    Next i had a look on ART (http://bioinformatics.joyhz.com/ART/). At the first glance it looks very promising, providing the simulation of single and paired end reads at variable lengths and also the calculation of base scores. Perfect! But again there is a drawback. The reads must not exceed a maximum length of 36 bp (i need 76 bp).
    Afterwards i saw that maq (http://maq.sourceforge.net/maq-man.shtml) is able to simulate illumina reads, and also to estimate base qualities, but unfortunately this is only possible for paired end reads.
    The last tool i tried was metasim (http://www-ab.informatik.uni-tuebing...ftware/metasim) and guess what! This tool is not able to estimate base scores, nor must the reads exceed a total length of 36 bp.

    So, before i start to write a new tool, that is able to fullfill my needs, is anyone out there who knows a program for simulating single end illumina reads at variable lengths?

  • #2
    Take a look at the dwgsim utility in Nils Homer's DNA Analysis package at dnaa.sourceforge.net. It does paired-end reads by default but you could just ignore the file of second reads in each pair to obtain single-end reads.

    Comment


    • #3
      Looks promising, but it seems that dwgsim also does not estimate the base quality scores. They seem to be constant, or is there an option to include estimations of the quality scores, that i just overlooked?

      Comment


      • #4
        I'm not sure exactly what you mean by "estimate base quality scores". You don't need to estimate quality scores for simulated data as you know the true quality scores. That is, you know the true base error rate. As an example, if you simulate reads using the dwgsim option -e 0.001, all bases will have quality values of "$", which translates to a Phred score of 30, which represents a probability of error of 0.001 (as specified in the call to dwgsim).

        Comment


        • #5
          Okay, i think i get it. Thanks a lot. The option -e x just assigns a uniform error rate x, and subsequently a uniform base quality score (Phred score) to each base, whereas the option -a x1-x2 assigns uniformly distributed range of error rates between x1 and x2 to each base.
          Cool, now i can start my simulations . Thanks again

          Comment


          • #6
            Hi MBender,

            it seems you already found a tool that suits your need, I just wanted to mention that we've written a simulator - originally for BS-Seq data but it works just as well for normal genomic data - that might be of use to you. It currently has the following features:

            - generate any number of sequences (default: 1000000)
            - generate sequences of any length
            - generate either completely random sequences or use specified genomic sequences
            - adjustable bisulfite conversion rate from 0-100% of all Ts (default: 100)
            - generate directional or non-directional libraries
            - write sequence out in base space or ABI color space format
            - adjustable default Phred quality score (Sanger encoding) (default: 40)
            - sequences can have a constant Phred quality throughout the read (with default quality)
            - introduce a variable number of SNPs into each read. All bp will have a constant quality score throughout the read which can be set manually (and is 40 by default).
            - sequences can have quality scores following an exponential decay curve. The overall error rate for sequences of varying length follows the calculated error model, and the overall error rate can be specified by the user. For example, a 0.1% error rate will eventually harbour 0.1% SNPs resembling 'real' data error curves (cf. introducing a fixed number of SNPs per sequence).
            - introduce a fixed amount of adapter sequence at the 3' end of all sequences. Available for all error models.
            - introduce a variable amount of adapter sequence at various positions at the 3' end of reads. For this the user can specify a mean insert size of their library, e.g. 150bp. The simulator then calculates a normal distribution of fragment sizes around this mean, and introduces variable bp of adapter sequence into the reads if the fragment size was smaller than the read length. Available for all error models.
            - introduce a variable percentage of adapter sequence (full read length) as contamination. Available for all error models.

            Most the features have been tested to be working correctly with FastQC and by various other means, let me know if you are still interested.

            Comment


            • #7
              Hi Felix,

              thanks for your reply, your simulation software sounds very useful to me, mainly the features about introducing SNPs, quality scores and bisulfite conversion, in descending order.
              First of all it would be interesting to know how you manage the introduction of SNPs. Assuming you are sequencing a population of cells from a certain tissue of a diploid species, then you would expect to find at certain positions heterozygeous SNPs. You said, that your program is able to introduce a variable number of SNPs into each read. Does this mean, that you introduce a SNP at some (reference-) positions and each read covering this position is of variant A or B?
              Next you said, that sequences can have quality scores following an exponential decay curve. Does this mean, that let's say at the 3' end of the sequence you will have a high score and in direction of the 5' end the score of the bases decays? Or is it more like an error probability for each base that follows an exponential distribution?
              Last but not least you said that your program is able to simulate bisulfite converted reads at a certain conversion rate. Does it just randomly introduce conversions of unmethylated C's into T's at a certain rate, or is it able to introduce these conversions following a predefined distribution throughout the reference sequence. Even more intresting would be to introduce bisulfite conversions with some preference for one allele (defined by the SNP on this read, if there is any).

              I'm concerned with setting up a pipeline for investigating allele-specific epigenetic features, therefore i need a simulation program that fullfils some needs. I would be glad, if you could provide me some more informations, and maybe the source of your tool.

              Comment


              • #8
                We wanted to look at mapping efficiencies depending on the number of SNPS (with high quality). Therefore as it stands the simulator will introduce a user-specified number of SNPs per read, but the SNPs themselves are random. With a little bit of extra work I suppose one could also generate sequences with true SNPs, for example by reading in a vcf SNP file while storing the locations and SNP bases. Or if you are working with a mouse strain and have already made a genome for the other strain harbouring all SNPs you could just read in two genomes to start with and select one of them at random to extract the sequences. This would certainly be much quicker. If you are not working with a known genome one could also simulate SNP positions (a certain number of them). Also, the option to introduce SNPs is currently mutually exclusive with the introduction of errors (so that we can look at them separately), but the error model could easily be also applied to to sequences carrying SNPs if you were interested in such a feature.


                The error probability follows an exponential curve, whereby the values at the 5' end of the sequence are very close to the default Phred quality score (40 but this can be adjusted) and they drop of (more or less drastically) towards the 3' end of reads. In effect, this means that more errors will be introduced towards the end of all sequences. If you can specify an overall per base error rate of, say 0.5%, you will hardly ever see errors at the high-quality 5' end however might see a few towards the 3' end. I just ran a few simulations and then looked at the error distributions with FastQC to see if they look like real datasets. Empirically, I would say that a typical 40bp Illumina run these days can be simulated with an error rate of ~0.1%, a 104 bp run probably more with a rate of ~1%.


                The bisulfite conversion works currently either with a uniform conversion rate for all Cs (converting at each position according to the specified conversion rate) or also independently for CpG and non-CpG context, each of which can be specified with a float number, e.g. --CG_conversion 32 --CH_conversion 99.8 (which means methylation rates of ~68% for CpG context and 0.2% for non-CpG context). This would naturally also be applied to alleles carrying SNPs which involve Cs, but as it stands the tool does not apply certain conversions for different regions in the reference genome, such as having low methylation levels for CpGs within CpG islands. This could certainly be accomplished, but I guess it would require quite a bit of extra work.


                I would possibly be able to implement a solution concerning the SNPs for you, however I don't know if I could fit an implementation of specific bs-conversion distributions into my current work schedule.... The tool is written in Perl, so you could of course use the code and write a solution yourself. If you could just send me an email to [email protected] saying what solution you would prefer at the moment I'd be happy to share the code.

                Comment


                • #9
                  Note that the simulator in DNAA also includes tools to assess the mapping error rates (dwgsim_eval) and SNP/indel calling sensitivity and specificity (dwgsim_pileup_eval.pl).

                  Comment


                  • #10
                    One strategy for generating error profiles is simply to copy them directly from an existing data set, assuming you can find one with the same length.

                    Flat probabilities or uniformly distributed probabilities may not be sufficient, depending on the use they're put to. Eg for SNP detection it may be fine, but for alignment or assembly perhaps not.

                    The reason is due to the way many algorithms work looking for fixed kmer seeds. Errors randomly distributed throughout your read means it's quite likely to get an error in the middle, but if they're clustered towards the end due to some natural decay process of the technology (eg Illumina) then with the same overall error rate you're still more likely to be able to match with the same kmer length.

                    James

                    PS. That said my current pet peeve is assemblers that just assume you can use a huge kmer and claim you need 100 fold depth instead to compensate. Sure if you go deep enough sooner or later you'll find reads with 70bp in a row with no errors, but frankly I'd rather use an algorithm that doesn't require me to toss out most of my data and only work on the top qualiy stuff. Sequencing is cheap now, but it doesn't mean we should waste it on sloppy algorithms! :P

                    Comment


                    • #11
                      Another issue is that you don't know how mappers will do on random DNA or very high error data. If a mapper does an exhaustive search, then it might take a long time to figure out a random read does not map. It may even try to pigeonhole the random read onto the reference. Keep that in mind when simulating reads.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      30 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      32 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X