Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simulating FastQ libraries for BS-Seq or normal applications using Sherman

    We have just made available a FastQ simulation script, termed Sherman, for high-throughput bisulfite (or standard genomic) sequencing datasets. It can generate single-end or paired-end data in both nucleotide-/base-space (such as from the Illumina platform) and color-space (such as from the SOLiD platform).

    Sherman was designed to assess the influence of common problems observed in many Next-Gen Sequencing libraries on the primary analysis of BS-Seq data. Thus, it allows the user to introduce various 'contaminants' into the simulated libraries, including basecall errors (following an exponential decay model), SNPs, Illumina adapter fragments and more.

    These are the main features:
    • Generate any number of sequences of any length
    • Generate either completely random sequences or use genomic sequences (genome can be specified)
    • Generates single-end or paired-end data with variable fragment sizes
    • Adjustable bisulfite conversion rate from 0-100% for either all cytosines or cytosines in CH and CG context individually
    • Generate directional or non-directional libraries
    • Generate sequences in base-space or SOLiD color-space format
    • Adjustable default Phred quality score (Sanger encoding, Phred+33 format)
    • Sequences can have constant Phred qualities throughout the read or can have quality scores following an exponential decay curve, which will eventually result in basecall errors (note that this is handled slightly different for base- and color-space data)
    • Introduce a variable number of random SNPs into each read
    • Introduce a fixed amount of adapter sequence at the 3' end of all sequences
    • Introduce a variable amount of adapter sequence at various positions at the 3' end of reads

    While including the paired-end option, Sherman has received a major overhaul so it should now run much quicker and be less memory-intensive. Initially, Sherman was designed to generate the kinds of library contaminations we were interested in, but if you have any ideas or suggestions which could be implemented (_easily_) we would love to hear from you.

    Sherman can be found at www.bioinformatics.bbsrc.ac.uk/projects/

  • #2
    identical qualities?

    Hi, this looks to be quite useful.

    I call like:

    Code:
    ./Sherman -n 100000 -l 50 -cr 0 --colorspace --error_rate 1 --genome_folder ~/data/hg19/ --quality 30
    If I do the following, I get only 1 line of output:
    Code:
    $ awk '(NR %2 == 0)' simulated_QV.qual | uniq
    e.g. There is no randomness in the quality values.
    Is this as intended?

    thanks,
    -Brent

    Comment


    • #3
      Hi Brent,

      It is true that all reads have the same quality values at each position, and this is modeled so that on average there is a certain chance, of in your case 1%, of incorporating a sequencing error spread over the entire sequence. A certain degree of randomness is achieved at the point when the error is actually introduced, because this is decided randomly against the Phred score (= probability that a basecall is wrong) for each bp individually.

      Hope this isn't too confusing.

      Best,
      Felix

      Comment


      • #4
        Got it. Thanks for the explanation.

        Comment


        • #5
          We have just released an updated version of Sherman (v0.1.1) which fixes an issue with the simulation of non-directional paired-end data and improves some other minor aspects.

          Comment


          • #6
            We have updated Sherman (v0.1.2) so that reads which were simulated from an existing genome carry the genomic coordinates in the sequence ID. This makes it easier to determine the accuracy of different aligners..

            Comment


            • #7
              We have released a new version of the bisulfite simulator Sherman (v.0.1.4). This update fixes the following flaw:

              During context specific cytosine conversion, until now Sherman assumed that a C at the last position was in CH context. This did however cause a weird blip in the M-bias plots (introduced into the Bismark methylation extractor as of v0.8.0) of simulated data at the end or read 1 and at the start of read 2 whenever the read was actually in CpG context. To account for this, Sherman does now determine the sequence context of the last position in a read correctly.

              Sherman is available here: https://www.bioinformatics.babraham....jects/sherman/.

              Comment


              • #8
                Hello,

                I'm using Sherman to generate sets of 32 bp genomic sequences for use as random control "libraries" to some transcriptome libraries our lad has made. I compare the distribution of these random "reads" in different annotated genomic categories (how many fall within genes, transposons, etc.) to that of the transcriptome libraries.

                So, a question about the --genome_folder option: How random are the sequences generated when this option is chosen? How are, for example, the different 32-mers chosen from the chromosome coordinates given?

                This is the command I use:

                ./Sherman -l 32 -n 51402229 --genome_folder /genome/ZmB73_Refgen/

                Just looking at two simulated files generated by using the identical command, I see they're not the same, but I just wanted to get a sense of how different they are.

                Thanks,

                Karl

                Comment


                • #9
                  Hi Karl,
                  The starting position in the genome is determined by first concatenating all chromosomes into one big long sequence, and then generating random numbers using the Perl rand() function. Using this number it does then first determine which chromosome and starting position this would correspond to, and extract 32bp sequence at this position. So in essence it should be as 'random' as the Perl rand() function is. Hope this helps.

                  Comment


                  • #10
                    Ah, I was guessing it might be the Perl rand() function generating the coordinates, but wanted to be sure. Thanks very much!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    27 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    26 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X