Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The problem of wgsim simulator

    Dear all

    I used wgsim to simulate illumina reads, and I got read1 and read2 fastq files.

    But all bases of quality scores are "2"

    222222222222222222222222222222222222222222222222222222222222222222222222222
    @gi|49175990|ref|NC_000913.2|_26305_26624_0:2:0_2:0:0_b/1
    GTTTTTGTGCCGGTGTAGACCGCGCTATCAGCATTGTTGAAAACGCGCTGGCCATTTGCGGCGCACCGATATATG
    +
    222222222222222222222222222222222222222222222222222222222222222222222222222
    @gi|49175990|ref|NC_000913.2|_967853_968177_1:7:3_2:3:0_c/1
    TGACGATTACCGCATAAACCGACTTTAAGCACCCCGCTCGCTAACGCATACGCCCCGCCGGCAACCACCAGCCAT


    What's wrong with my process
    ~/Simulation/samtools-wgsim$ wgsim -d 350 -s 30 -N 70000 -1 75 -2 75 /Simulation out.read1.fq out.read2.fq

    or the situation is normal ?

  • #2
    Hi,

    I ran into a similar problem a few months back. My memory is a bit fuzzy, but if I recall correctly, wgsim only simulates read data, it doesn't do anything to simulate quality scores.

    In order to get simulated quality scores as well, I switched to maq simulate. If you give it a sequence file from some NGS data (e.g. a run of paired-end sequence), it will create synthetic quality scores based on the quality scores from your NGS data. I think it uses a Markov process to generate the quality lines.

    Hope this helps!

    Comment


    • #3
      Originally posted by MBekritsky View Post
      Hi,

      I ran into a similar problem a few months back. My memory is a bit fuzzy, but if I recall correctly, wgsim only simulates read data, it doesn't do anything to simulate quality scores.

      In order to get simulated quality scores as well, I switched to maq simulate. If you give it a sequence file from some NGS data (e.g. a run of paired-end sequence), it will create synthetic quality scores based on the quality scores from your NGS data. I think it uses a Markov process to generate the quality lines.

      Hope this helps!
      Thank you very much. This information is very helpful for me.

      I tried using Maq to simulate E.coli K12 illumina sequencing reads, and I noticed that I need a simupars.dat file to simluate the data. According to the help manual, I can get *simpuars.dat* from excuting "simutrain" or from the Maq website, but I didn't find any related files on the Maq download page. On the other hand, I don't have E.coli K12 illumina real data to generate the simpuars.dat.

      Does anyone can help me to figure out the issue? Thanks in advance!

      Comment


      • #4
        Hi again,

        In my experience, the data file doesn't need to be a run from the same species, since all simutrain does (I think) is calculate quality score frequencies by read position. My suggestion is to find some recent sequencing data from the same machine with the same read length of the samples you'll be submitting and use that for simutrain. This way if there's any quirks or any quality score phenomena intrinsic to the machine you'll be using, it'll be something you may be able to catch at the simulation stage.

        In my opinion, you would be better served by using any real data from the machine you'll be using for sequencing than by trying to find E. coli Illumina sequence from a different machine. As a case in point, I've used cancer sequencing data to train MAQ simulate for simulations on "normal" data. As far as I can tell, there was nothing in simutrain that biased my results.

        Comment


        • #5
          Thanks your advices. I will try it.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 08:47 AM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          57 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X