Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • wilflugo
    Junior Member
    • Dec 2014
    • 8

    How to generate random short reads from a reference genome

    Hi,

    I am in the process of generating accuracy benchmarks against different assembly algorithms/tools (including one of my own). Is there an established procedure or common tool to generate these reads from a reference genome?

    I know bedtools is able to generate reads based on random positions in the genome, but from what I read seems they will be exact matches (which are randomly placed) against the genome. http://seqanswers.com/forums/images/icons/icon8.gif

    I am looking for a tool that is able to add 'noise' to these generated reads in the form of insertions, deletions or mutations on single or few nucleotides.

    Such a tool or procedure exist?

    Thanks in advance.
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    I wrote a program for that purpose; it's part of BBTools. Basic usage:

    randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000

    You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.

    For usage information, run the shellscript with no arguments (or open it with a text editor).

    I also have a couple of programs for grading sam files generated using these reads by parsing the read names (samtoroc.sh and gradesam.sh).

    Comment

    • HESmith
      Senior Member
      • Oct 2009
      • 512

      #3
      I want to generate synthetic reads to benchmark some variant-calling pipelines. The strategy is to introduce variants into the reference genome, generate random reads from this reference, then align to the non-variant reference. BBTools randomreads.sh seems perfect for this application, esp. the ability to introduce errors into the reads. Two questions:

      1) I'd like to generate ~40M 50bp SE reads. Is there an upper limit on the number of independent reads that can be produced?

      2) I'd also like to create heterozygous variant reads (by combining the output from a second variant reference). I noticed that randomreads produces the identical output (chromosome/position) by default. Is there a way to change the random generator (e.g., by specifying a different seed) to obtain different output reads?

      Thanks,
      Harold

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        Harold: Take a look at ART as an additional option. http://www.niehs.nih.gov/research/re...tatistics/art/

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #5
          Originally posted by HESmith View Post
          1) I'd like to generate ~40M 50bp SE reads. Is there an upper limit on the number of independent reads that can be produced?
          There's no limit; just set "reads=40000000 length=50".

          2) I'd also like to create heterozygous variant reads (by combining the output from a second variant reference). I noticed that randomreads produces the identical output (chromosome/position) by default. Is there a way to change the random generator (e.g., by specifying a different seed) to obtain different output reads?
          Yes - "seed=-1" will use a random seed; any other value will use that specific number as the seed.

          Comment

          • HESmith
            Senior Member
            • Oct 2009
            • 512

            #6
            Great! Thanks for the quick reply, Brian.

            Comment

            • wilflugo
              Junior Member
              • Dec 2014
              • 8

              #7
              I was able to generate all the reads I want with more variabled that I could have asked for. Thanks!. This tool is great!.

              However, how I can know the origins of a read?

              For example, now I have:
              >0
              ATACTAAGCGATAGACAAATGACTTTTGCTTCCCTACCATTGA

              How do I know which specific scaffold and specific position on the genome this read "mutation" was based from?

              From the documentation it says "Read names indicate their genomic origin." However on all my runs the read names are just numbers between 0 and reads-1.
              Last edited by wilflugo; 04-10-2015, 08:45 AM.

              Comment

              • HESmith
                Senior Member
                • Oct 2009
                • 512

                #8
                It looks like your output is in FASTA format. To get chromosome/contig and position information in the identifier, output in FASTQ format. If necessary, you can convert to FASTA using the reformat command.

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  #9
                  Hmmm....

                  randomreads.sh reads=4 out=random.fq

                  Gave me:

                  Code:
                  @0_chr1_0_3026641_3026740_3018641_Ecoli
                  GTTCGCCGGGGGCAGCAAACTCAATGCTACACCAACCCGTACCGATAAAAAGATTGCCATTTCCTTACAGGATCTGGAACTGGACTGGGTTGACTGGGAT
                  +
                  ====@==@AAD?>D@@==DACBC?@BB@C==AB==A@D>AD==?CB==@=B?=A>D?=DB=?>>D@EB===??=@C=?C>@>@B>=?C@@>=====?@>=
                  @1_chr1_0_831114_831213_823114_Ecoli
                  TCCGTGGGCGCGGGTAAAAATCTCATCCAGTCCGGCCTGCACTTTTAACGGATGATTAGCTTTTTGCCGCCAGTCATTGAAATCACCCGCCACCAATACC
                  +
                  ????@CCBECFFGGFEDFGEDFEDGEDFEFEFEFCFGFEEDECEFFADEEEEGCFDFEEEEBCCEFFFFDEDDFEECCFEDEGFFF@BFEDADAC????@
                  @2_chr1_1_15074406_15074505_4818091_Soneidensis
                  AGGAAAATGTCGCGTTATCGCCTCTACTTGCCTGTATTATCGAGGAGCAGCAACGGGTACATGGTCATGAGCAAAAAATTCAATTATTTTCTGACGATAC
                  +
                  @@BCA@FGEHFGGFGHHFHGEHGFFFHEGFGHGFGEEFEEFFFGFECDFFDHHEDFFDGFEGEEGDDGDEEEECGEDDECFGCFFEFDGFEFFDBAAB@@
                  @3_chr1_0_4666555_4666654_18580_Evietnamensis
                  ATTTTATAAATATAATTCCATTATTAACTATTTACTGCTTAAAAAATCATTTACAGCCAAAAATAAGTAAATTATTTACATTTTCAGACTGCTTGCAGCT
                  +
                  <<<>>@@BABCCDBDDBDBDDCABBBDB?BCCBBBCBBABC@@CBA@C@C=A=B?BCA<?AA@?B>@B?>CA@@BABBA>@@?BB?=CA<A=A<?><<><
                  The headers are a little annoying, but take this read, for example:

                  @2_chr1_1_15074406_15074505_4818091_Soneidensis

                  That's read 2; it came from the minus strand; and starts at position 4818091 in the sequence named Soneidensis. The stop location of the read is harder to figure out, but is 4818091+(15074505-15074406)=4818190.

                  I just tested it and it looks like with fasta output you don't get the custom names. In that case, generate fastq output, then you can change it to fasta format like this, keeping the names:

                  reformat.sh in=random.fq out=random.fa

                  I will fix that oversight soon.

                  Edit: Ninja'd by HESmith!

                  Comment

                  • HESmith
                    Senior Member
                    • Oct 2009
                    • 512

                    #10
                    Even the blind squirrel finds the occasional nut…

                    Does this mean I've earned my first milli-bushnell?

                    -Harold

                    Comment

                    • Brian Bushnell
                      Super Moderator
                      • Jan 2014
                      • 2709

                      #11
                      Yes, I award you +0.001

                      Comment

                      • wilflugo
                        Junior Member
                        • Dec 2014
                        • 8

                        #12
                        Thank you guys!. That was the problem. I was generating the output in FASTA.
                        thanks again.

                        Comment

                        • wilflugo
                          Junior Member
                          • Dec 2014
                          • 8

                          #13
                          Sorry to continue bothering you guys with this. However, I have been using randomreads.sh from BBmap a lot recently and want to see if what I need can be accomplished.

                          I am trying to get a based set of reads that are an exact match against the reference genome (no SNPs, no quality issues, just exact chunks of the genome). Since what I am doing is benchmarking different tools accuracy (including one of my own) I want to have a base test case where all tools can match all the reads. Then I would like to start adding SNPs to the reads generation to see how each tool behaves.

                          Seems I have not being able to generate a set of reads which could be exact matches. This is what I am trying to do:

                          $ ./randomreads.sh ref=<my_reference> maxsnps=0 maxinss=0 maxdels=0 maxsubs=0 adderrors=false out=reads.fastq reads=1000 minlength=18 maxlength=55 seed=-1

                          Seems that around 45% of the reads generated are exact matches (calculated by doing grep file matches of the generated read against the reference) but not all of them are. Are there any other parameters that could be added to for exact reads? I could create a simple program myself but I want to continue using random reads for adding SNPs once my base is obtained.

                          I am assuming that after I can get an exact read then I will just be adding SNPs (using maxsnps parameter). Correct?

                          Comment

                          • HESmith
                            Senior Member
                            • Oct 2009
                            • 512

                            #14
                            'grep' doesn't wrap lines, and most references have a defined number of characters per line. I suspect that your unmatched reads are split across two lines. Check and let us know.

                            Comment

                            • wilflugo
                              Junior Member
                              • Dec 2014
                              • 8

                              #15
                              Thanks, yes you are correct. However, I forgot to mention that for the purpose of this exercise I trimmed all scaffolds in the reference genome to have their sequence on a single line.

                              The reason I did that was to understand why the tools I am benchmarking produced so low accuracy when compare to the start position on the read. When I performed the grep exercise I found the the tools results matched perfectly on the reads that were exact, but bad on the non exact one. Since I want to get the same baseline I want to have a set of perfect matches.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              26 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              43 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              48 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...