Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The problem of wgsim simulator

    Dear all

    I used wgsim to simulate illumina reads, and I got read1 and read2 fastq files.

    But all bases of quality scores are "2"

    222222222222222222222222222222222222222222222222222222222222222222222222222
    @gi|49175990|ref|NC_000913.2|_26305_26624_0:2:0_2:0:0_b/1
    GTTTTTGTGCCGGTGTAGACCGCGCTATCAGCATTGTTGAAAACGCGCTGGCCATTTGCGGCGCACCGATATATG
    +
    222222222222222222222222222222222222222222222222222222222222222222222222222
    @gi|49175990|ref|NC_000913.2|_967853_968177_1:7:3_2:3:0_c/1
    TGACGATTACCGCATAAACCGACTTTAAGCACCCCGCTCGCTAACGCATACGCCCCGCCGGCAACCACCAGCCAT


    What's wrong with my process
    ~/Simulation/samtools-wgsim$ wgsim -d 350 -s 30 -N 70000 -1 75 -2 75 /Simulation out.read1.fq out.read2.fq

    or the situation is normal ?

  • #2
    Hi,

    I ran into a similar problem a few months back. My memory is a bit fuzzy, but if I recall correctly, wgsim only simulates read data, it doesn't do anything to simulate quality scores.

    In order to get simulated quality scores as well, I switched to maq simulate. If you give it a sequence file from some NGS data (e.g. a run of paired-end sequence), it will create synthetic quality scores based on the quality scores from your NGS data. I think it uses a Markov process to generate the quality lines.

    Hope this helps!

    Comment


    • #3
      Originally posted by MBekritsky View Post
      Hi,

      I ran into a similar problem a few months back. My memory is a bit fuzzy, but if I recall correctly, wgsim only simulates read data, it doesn't do anything to simulate quality scores.

      In order to get simulated quality scores as well, I switched to maq simulate. If you give it a sequence file from some NGS data (e.g. a run of paired-end sequence), it will create synthetic quality scores based on the quality scores from your NGS data. I think it uses a Markov process to generate the quality lines.

      Hope this helps!
      Thank you very much. This information is very helpful for me.

      I tried using Maq to simulate E.coli K12 illumina sequencing reads, and I noticed that I need a simupars.dat file to simluate the data. According to the help manual, I can get *simpuars.dat* from excuting "simutrain" or from the Maq website, but I didn't find any related files on the Maq download page. On the other hand, I don't have E.coli K12 illumina real data to generate the simpuars.dat.

      Does anyone can help me to figure out the issue? Thanks in advance!

      Comment


      • #4
        Hi again,

        In my experience, the data file doesn't need to be a run from the same species, since all simutrain does (I think) is calculate quality score frequencies by read position. My suggestion is to find some recent sequencing data from the same machine with the same read length of the samples you'll be submitting and use that for simutrain. This way if there's any quirks or any quality score phenomena intrinsic to the machine you'll be using, it'll be something you may be able to catch at the simulation stage.

        In my opinion, you would be better served by using any real data from the machine you'll be using for sequencing than by trying to find E. coli Illumina sequence from a different machine. As a case in point, I've used cancer sequencing data to train MAQ simulate for simulations on "normal" data. As far as I can tell, there was nothing in simutrain that biased my results.

        Comment


        • #5
          Thanks your advices. I will try it.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          51 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X