SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to generate random size matched background regions gene_x Bioinformatics 1 07-25-2015 02:14 AM
How to generate random short sequences pony2001mx Bioinformatics 2 12-16-2013 11:09 PM
Generate short-reads for RNA-Seq. wilson90 Bioinformatics 0 01-19-2013 08:49 PM
Finding genome coverage using random reads smandape1 Bioinformatics 1 12-07-2011 09:32 PM
looking for reference genome based assembler for short-reads zchou Bioinformatics 3 12-16-2009 08:13 PM

Reply
 
Thread Tools
Old 12-15-2014, 03:52 PM   #1
wilflugo
Junior Member
 
Location: Puerto Rico

Join Date: Dec 2014
Posts: 8
Default How to generate random short reads from a reference genome

Hi,

I am in the process of generating accuracy benchmarks against different assembly algorithms/tools (including one of my own). Is there an established procedure or common tool to generate these reads from a reference genome?

I know bedtools is able to generate reads based on random positions in the genome, but from what I read seems they will be exact matches (which are randomly placed) against the genome. http://seqanswers.com/forums/images/icons/icon8.gif

I am looking for a tool that is able to add 'noise' to these generated reads in the form of insertions, deletions or mutations on single or few nucleotides.

Such a tool or procedure exist?

Thanks in advance.
wilflugo is offline   Reply With Quote
Old 12-15-2014, 04:42 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I wrote a program for that purpose; it's part of BBTools. Basic usage:

randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000

You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.

For usage information, run the shellscript with no arguments (or open it with a text editor).

I also have a couple of programs for grading sam files generated using these reads by parsing the read names (samtoroc.sh and gradesam.sh).
Brian Bushnell is offline   Reply With Quote
Old 01-09-2015, 08:03 AM   #3
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

I want to generate synthetic reads to benchmark some variant-calling pipelines. The strategy is to introduce variants into the reference genome, generate random reads from this reference, then align to the non-variant reference. BBTools randomreads.sh seems perfect for this application, esp. the ability to introduce errors into the reads. Two questions:

1) I'd like to generate ~40M 50bp SE reads. Is there an upper limit on the number of independent reads that can be produced?

2) I'd also like to create heterozygous variant reads (by combining the output from a second variant reference). I noticed that randomreads produces the identical output (chromosome/position) by default. Is there a way to change the random generator (e.g., by specifying a different seed) to obtain different output reads?

Thanks,
Harold
HESmith is offline   Reply With Quote
Old 01-09-2015, 08:16 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Harold: Take a look at ART as an additional option. http://www.niehs.nih.gov/research/re...tatistics/art/
GenoMax is offline   Reply With Quote
Old 01-09-2015, 09:15 AM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by HESmith View Post
1) I'd like to generate ~40M 50bp SE reads. Is there an upper limit on the number of independent reads that can be produced?
There's no limit; just set "reads=40000000 length=50".

Quote:
2) I'd also like to create heterozygous variant reads (by combining the output from a second variant reference). I noticed that randomreads produces the identical output (chromosome/position) by default. Is there a way to change the random generator (e.g., by specifying a different seed) to obtain different output reads?
Yes - "seed=-1" will use a random seed; any other value will use that specific number as the seed.
Brian Bushnell is offline   Reply With Quote
Old 01-09-2015, 09:49 AM   #6
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

Great! Thanks for the quick reply, Brian.
HESmith is offline   Reply With Quote
Old 04-10-2015, 08:21 AM   #7
wilflugo
Junior Member
 
Location: Puerto Rico

Join Date: Dec 2014
Posts: 8
Default

I was able to generate all the reads I want with more variabled that I could have asked for. Thanks!. This tool is great!.

However, how I can know the origins of a read?

For example, now I have:
>0
ATACTAAGCGATAGACAAATGACTTTTGCTTCCCTACCATTGA

How do I know which specific scaffold and specific position on the genome this read "mutation" was based from?

From the documentation it says "Read names indicate their genomic origin." However on all my runs the read names are just numbers between 0 and reads-1.

Last edited by wilflugo; 04-10-2015 at 08:45 AM.
wilflugo is offline   Reply With Quote
Old 04-10-2015, 09:10 AM   #8
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

It looks like your output is in FASTA format. To get chromosome/contig and position information in the identifier, output in FASTQ format. If necessary, you can convert to FASTA using the reformat command.
HESmith is offline   Reply With Quote
Old 04-10-2015, 09:22 AM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hmmm....

randomreads.sh reads=4 out=random.fq

Gave me:

Code:
@0_chr1_0_3026641_3026740_3018641_Ecoli
GTTCGCCGGGGGCAGCAAACTCAATGCTACACCAACCCGTACCGATAAAAAGATTGCCATTTCCTTACAGGATCTGGAACTGGACTGGGTTGACTGGGAT
+
====@==@AAD?>D@@==DACBC?@BB@C==AB==A@D>AD==?CB==@=B?=A>D?=DB=?>>D@EB===??=@C=?C>@>@B>=?C@@>=====?@>=
@1_chr1_0_831114_831213_823114_Ecoli
TCCGTGGGCGCGGGTAAAAATCTCATCCAGTCCGGCCTGCACTTTTAACGGATGATTAGCTTTTTGCCGCCAGTCATTGAAATCACCCGCCACCAATACC
+
????@CCBECFFGGFEDFGEDFEDGEDFEFEFEFCFGFEEDECEFFADEEEEGCFDFEEEEBCCEFFFFDEDDFEECCFEDEGFFF@BFEDADAC????@
@2_chr1_1_15074406_15074505_4818091_Soneidensis
AGGAAAATGTCGCGTTATCGCCTCTACTTGCCTGTATTATCGAGGAGCAGCAACGGGTACATGGTCATGAGCAAAAAATTCAATTATTTTCTGACGATAC
+
@@BCA@FGEHFGGFGHHFHGEHGFFFHEGFGHGFGEEFEEFFFGFECDFFDHHEDFFDGFEGEEGDDGDEEEECGEDDECFGCFFEFDGFEFFDBAAB@@
@3_chr1_0_4666555_4666654_18580_Evietnamensis
ATTTTATAAATATAATTCCATTATTAACTATTTACTGCTTAAAAAATCATTTACAGCCAAAAATAAGTAAATTATTTACATTTTCAGACTGCTTGCAGCT
+
<<<>>@@BABCCDBDDBDBDDCABBBDB?BCCBBBCBBABC@@CBA@C@C=A=B?BCA<?AA@?B>@B?>CA@@BABBA>@@?BB?=CA<A=A<?><<><
The headers are a little annoying, but take this read, for example:

@2_chr1_1_15074406_15074505_4818091_Soneidensis

That's read 2; it came from the minus strand; and starts at position 4818091 in the sequence named Soneidensis. The stop location of the read is harder to figure out, but is 4818091+(15074505-15074406)=4818190.

I just tested it and it looks like with fasta output you don't get the custom names. In that case, generate fastq output, then you can change it to fasta format like this, keeping the names:

reformat.sh in=random.fq out=random.fa

I will fix that oversight soon.

Edit: Ninja'd by HESmith!
Brian Bushnell is offline   Reply With Quote
Old 04-10-2015, 09:54 AM   #10
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

Even the blind squirrel finds the occasional nutů

Does this mean I've earned my first milli-bushnell?

-Harold
HESmith is offline   Reply With Quote
Old 04-10-2015, 09:56 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yes, I award you +0.001
Brian Bushnell is offline   Reply With Quote
Old 04-10-2015, 11:48 AM   #12
wilflugo
Junior Member
 
Location: Puerto Rico

Join Date: Dec 2014
Posts: 8
Default

Thank you guys!. That was the problem. I was generating the output in FASTA.
thanks again.
wilflugo is offline   Reply With Quote
Old 05-18-2015, 05:13 AM   #13
wilflugo
Junior Member
 
Location: Puerto Rico

Join Date: Dec 2014
Posts: 8
Default

Sorry to continue bothering you guys with this. However, I have been using randomreads.sh from BBmap a lot recently and want to see if what I need can be accomplished.

I am trying to get a based set of reads that are an exact match against the reference genome (no SNPs, no quality issues, just exact chunks of the genome). Since what I am doing is benchmarking different tools accuracy (including one of my own) I want to have a base test case where all tools can match all the reads. Then I would like to start adding SNPs to the reads generation to see how each tool behaves.

Seems I have not being able to generate a set of reads which could be exact matches. This is what I am trying to do:

$ ./randomreads.sh ref=<my_reference> maxsnps=0 maxinss=0 maxdels=0 maxsubs=0 adderrors=false out=reads.fastq reads=1000 minlength=18 maxlength=55 seed=-1

Seems that around 45% of the reads generated are exact matches (calculated by doing grep file matches of the generated read against the reference) but not all of them are. Are there any other parameters that could be added to for exact reads? I could create a simple program myself but I want to continue using random reads for adding SNPs once my base is obtained.

I am assuming that after I can get an exact read then I will just be adding SNPs (using maxsnps parameter). Correct?
wilflugo is offline   Reply With Quote
Old 05-18-2015, 06:18 AM   #14
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 503
Default

'grep' doesn't wrap lines, and most references have a defined number of characters per line. I suspect that your unmatched reads are split across two lines. Check and let us know.
HESmith is offline   Reply With Quote
Old 05-18-2015, 07:38 AM   #15
wilflugo
Junior Member
 
Location: Puerto Rico

Join Date: Dec 2014
Posts: 8
Default

Thanks, yes you are correct. However, I forgot to mention that for the purpose of this exercise I trimmed all scaffolds in the reference genome to have their sequence on a single line.

The reason I did that was to understand why the tools I am benchmarking produced so low accuracy when compare to the start position on the read. When I performed the grep exercise I found the the tools results matched perfectly on the reads that were exact, but bad on the non exact one. Since I want to get the same baseline I want to have a set of perfect matches.
wilflugo is offline   Reply With Quote
Old 05-18-2015, 09:07 AM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by wilflugo View Post

Seems I have not being able to generate a set of reads which could be exact matches. This is what I am trying to do:

$ ./randomreads.sh ref=<my_reference> maxsnps=0 maxinss=0 maxdels=0 maxsubs=0 adderrors=false out=reads.fastq reads=1000 minlength=18 maxlength=55 seed=-1

Seems that around 45% of the reads generated are exact matches (calculated by doing grep file matches of the generated read against the reference) but not all of them are. Are there any other parameters that could be added to for exact reads? I could create a simple program myself but I want to continue using random reads for adding SNPs once my base is obtained.

I am assuming that after I can get an exact read then I will just be adding SNPs (using maxsnps parameter). Correct?
Using that command I got 100% of reads perfectly matching the reference:

Quote:
C:\temp\ecoli>java -ea -Xmx1g align2.RandomReads3 maxsnps=0 maxinss=0 maxdels=0 maxsubs=0 adderrors=false out=reads.fastq reads=1000 minlength=18 maxlength=55 seed=-1
snpRate=0.0, max=0, unique=true
insRate=0.0, max=0, len=(0-0)
delRate=0.0, max=0, len=(0-0)
subRate=0.0, max=0, len=(0-0)
nRate =0.0, max=0, len=(0-0)
genome=1
PERFECT_READ_RATIO=0.0
ADD_ERRORS_FROM_QUALITY=false
REPLACE_NOREF=false
paired=false
read length=18-55
Wrote reads.fastq
Time: 0.344 seconds.

C:\temp\ecoli>java -ea -Xmx1g align2.BBMap in=reads.fastq
Executing align2.BBMap [in=reads.fastq]

BBMap version 34.94
Retaining first best site only for ambiguous mappings.
No output file.
Set genome to 1

Loaded Reference: 0.057 seconds.
Loading index for chunk 1-1, build 1
Generated Index: 0.924 seconds.
Analyzed Index: 1.860 seconds.
Cleared Memory: 0.188 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 8 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7

------------------ Results ------------------

Genome: 1
Key Length: 13
Max Indel: 16000
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 1000 (36146 bases)

Mapping: 0.275 seconds.
Reads/sec: 3630.35
kBases/sec: 131.22


Read 1 data: pct reads num reads pct bases num bases

mapped: 100.0000% 1000 100.0000% 36146
unambiguous: 96.8000% 968 96.9181% 35032
ambiguous: 3.2000% 32 3.0819% 1114
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 100.0000% 1000 100.0000% 36146
semiperfect site: 100.0000% 1000 100.0000% 36146

Match Rate: NA NA 100.0000% 36146
Error Rate: 0.0000% 0 0.0000% 0
Sub Rate: 0.0000% 0 0.0000% 0
Del Rate: 0.0000% 0 0.0000% 0
Ins Rate: 0.0000% 0 0.0000% 0
N Rate: 0.0000% 0 0.0000% 0

Total time: 3.521 seconds.
Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly.

You can generate the same set of reads with and without SNPs by fixing the seed to a positive number, like this:

randomreads.sh maxsnps=0 adderrors=false out=perfect.fastq reads=1000 minlength=18 maxlength=55 seed=5

randomreads.sh maxsnps=2 snprate=1 adderrors=false out=2snps.fastq reads=1000 minlength=18 maxlength=55 seed=5


The RNG for the SNPs and positions are independent.
Brian Bushnell is offline   Reply With Quote
Old 05-18-2015, 09:18 AM   #17
wilflugo
Junior Member
 
Location: Puerto Rico

Join Date: Dec 2014
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly.
This is my problem then. This makes perfect sense know I think about it. I was incorrectly assuming all reads generated where from the same strand as in the genome.

Thanks again.
wilflugo is offline   Reply With Quote
Old 11-08-2015, 01:13 PM   #18
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 232
Default

Quote:
Originally Posted by Brian Bushnell View Post
I wrote a program for that purpose; it's part of BBTools. Basic usage:

randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000

You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.

For usage information, run the shellscript with no arguments (or open it with a text editor).

I also have a couple of programs for grading sam files generated using these reads by parsing the read names (samtoroc.sh and gradesam.sh).
Will this work using a transcriptome in FASTA format as input?
cement_head is offline   Reply With Quote
Old 11-08-2015, 02:48 PM   #19
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yes, it will. Any fasta is acceptable. You can't do anything regarding custom differential expression, though; it tries to generate a flat distribution.

Last edited by Brian Bushnell; 11-08-2015 at 02:51 PM.
Brian Bushnell is offline   Reply With Quote
Old 11-08-2015, 04:59 PM   #20
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 232
Default

Quote:
Originally Posted by Brian Bushnell View Post
Yes, it will. Any fasta is acceptable. You can't do anything regarding custom differential expression, though; it tries to generate a flat distribution.
Great! That's perfect - I need to "shred" a transcriptome so that I can use it to test a tool I'm making for molecular indexing. Thanks
cement_head is offline   Reply With Quote
Reply

Tags
assembly, random reads, reference genome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:35 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO