Seqanswers Leaderboard Ad

**Brian Bushnell** · 05-18-2015, 09:07 AM

Originally posted by wilflugo View Post

Seems I have not being able to generate a set of reads which could be exact matches. This is what I am trying to do:

$ ./randomreads.sh ref=<my_reference> maxsnps=0 maxinss=0 maxdels=0 maxsubs=0 adderrors=false out=reads.fastq reads=1000 minlength=18 maxlength=55 seed=-1

Seems that around 45% of the reads generated are exact matches (calculated by doing grep file matches of the generated read against the reference) but not all of them are. Are there any other parameters that could be added to for exact reads? I could create a simple program myself but I want to continue using random reads for adding SNPs once my base is obtained.

I am assuming that after I can get an exact read then I will just be adding SNPs (using maxsnps parameter). Correct?

Using that command I got 100% of reads perfectly matching the reference:

C:\temp\ecoli>java -ea -Xmx1g align2.RandomReads3 maxsnps=0 maxinss=0 maxdels=0 maxsubs=0 adderrors=false out=reads.fastq reads=1000 minlength=18 maxlength=55 seed=-1
snpRate=0.0, max=0, unique=true
insRate=0.0, max=0, len=(0-0)
delRate=0.0, max=0, len=(0-0)
subRate=0.0, max=0, len=(0-0)
nRate =0.0, max=0, len=(0-0)
genome=1
PERFECT_READ_RATIO=0.0
ADD_ERRORS_FROM_QUALITY=false
REPLACE_NOREF=false
paired=false
read length=18-55
Wrote reads.fastq
Time: 0.344 seconds.

C:\temp\ecoli>java -ea -Xmx1g align2.BBMap in=reads.fastq
Executing align2.BBMap [in=reads.fastq]

BBMap version 34.94
Retaining first best site only for ambiguous mappings.
No output file.
Set genome to 1

Loaded Reference: 0.057 seconds.
Loading index for chunk 1-1, build 1
Generated Index: 0.924 seconds.
Analyzed Index: 1.860 seconds.
Cleared Memory: 0.188 seconds.
Processing reads in single-ended mode.
Started read stream.
Started 8 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7

------------------ Results ------------------

Genome: 1
Key Length: 13
Max Indel: 16000
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 1000 (36146 bases)

Mapping: 0.275 seconds.
Reads/sec: 3630.35
kBases/sec: 131.22

Read 1 data: pct reads num reads pct bases num bases

mapped: 100.0000% 1000 100.0000% 36146
unambiguous: 96.8000% 968 96.9181% 35032
ambiguous: 3.2000% 32 3.0819% 1114
low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 100.0000% 1000 100.0000% 36146
semiperfect site: 100.0000% 1000 100.0000% 36146

Match Rate: NA NA 100.0000% 36146
Error Rate: 0.0000% 0 0.0000% 0
Sub Rate: 0.0000% 0 0.0000% 0
Del Rate: 0.0000% 0 0.0000% 0
Ins Rate: 0.0000% 0 0.0000% 0
N Rate: 0.0000% 0 0.0000% 0

Total time: 3.521 seconds.

Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly.

You can generate the same set of reads with and without SNPs by fixing the seed to a positive number, like this:

randomreads.sh maxsnps=0 adderrors=false out=perfect.fastq reads=1000 minlength=18 maxlength=55 seed=5

randomreads.sh maxsnps=2 snprate=1 adderrors=false out=2snps.fastq reads=1000 minlength=18 maxlength=55 seed=5

The RNG for the SNPs and positions are independent.

**wilflugo** · 05-18-2015, 09:18 AM

Originally posted by Brian Bushnell View Post

Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly.

This is my problem then. This makes perfect sense know I think about it. I was incorrectly assuming all reads generated where from the same strand as in the genome.

Thanks again.

**cement_head** · 11-08-2015, 02:13 PM

Originally posted by Brian Bushnell View Post

I wrote a program for that purpose; it's part of BBTools. Basic usage:

randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000

You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.

For usage information, run the shellscript with no arguments (or open it with a text editor).

I also have a couple of programs for grading sam files generated using these reads by parsing the read names (samtoroc.sh and gradesam.sh).

Will this work using a transcriptome in FASTA format as input?

**Brian Bushnell** · 11-08-2015, 03:48 PM

Yes, it will. Any fasta is acceptable. You can't do anything regarding custom differential expression, though; it tries to generate a flat distribution.

**cement_head** · 11-08-2015, 05:59 PM

Originally posted by Brian Bushnell View Post

Yes, it will. Any fasta is acceptable. You can't do anything regarding custom differential expression, though; it tries to generate a flat distribution.

Great! That's perfect - I need to "shred" a transcriptome so that I can use it to test a tool I'm making for molecular indexing. Thanks

**Brian Bushnell** · 11-08-2015, 06:58 PM

Incidentally, there's another tool that will do that too, Shred:

shred.sh in=ref.fasta out=reads.fastq length=200

The difference is that RandomReads will make reads in a random order from random locations, ensuring flat coverage on average, but it won't ensure 100% coverage unless you generate many fold depth. Shred, on the other hand, gives you exactly 1x depth and exactly 100% coverage (and is not capable of modelling errors). So, the use-cases are different.

**cement_head** · 11-09-2015, 11:54 AM

(Hopefully) last question(s):

I'd like to take a transcriptome (in the form of a file that contains all the FASTAs) - actually the ZF transcriptome and in silico fragment it into 250 bp "insert" sizes. Then I'd like to generate a pair of 100 bp PE reads from each fragment.

In other words, if I have 100,000 fragments of 250 bp, I'd like to end up with 200,000 PE reads - one set corresponding to each of the 100,000 fragments. I know this is artificial, but because we're trying to check code, we'd like to be very defined and controlled in this first test.

Thanks,
Andor

**Brian Bushnell** · 11-09-2015, 12:07 PM

This depends on exactly what you want to do with transcripts shorter than 250bp, but... there are 2 ways to do this:

randomreads.sh ref=transcriptome.fa out=synth.fq.gz reads=100000 len=100 paired interleaved mininsert=250 maxinsert=250

Or, if you shred it some way so you already have 250bp single-ended reads:

bbfakereads.sh in=shreds.fq out=pairs.fq length=100

I wrote that specifically for this purpose

Incidentally, both of these commands will produce interleaved reads; you can convert between interleaved and dual-file paired, or between fasta and fastq, with reformat.sh, if you have things in the wrong format.

The first command will only produce inserts of exactly 250bp, and the second will only produce inserts of exactly the length of the input sequences.

**cement_head** · 11-11-2015, 12:10 PM

Ok, thanks!

**abolia** · 11-11-2015, 12:11 PM

Hi all,
I want to generate reads for one of my rearranged genomes using a list of genomic coordinates in a BED file. The whole idea is to generate random DNA fragments from designated target regions.

I tried using Wessim simulator, but it throws me bunch of errors and when I tried contacting their team, they said they no longer work on that project.

Does anyone has any idea how to do this? Any help would be really great.

Thanks,
Ashini.

**GenoMax** · 11-11-2015, 03:41 PM

Originally posted by abolia View Post

Hi all,
I want to generate reads for one of my rearranged genomes using a list of genomic coordinates in a BED file. The whole idea is to generate random DNA fragments from designated target regions.

I tried using Wessim simulator, but it throws me bunch of errors and when I tried contacting their team, they said they no longer work on that project.

Does anyone has any idea how to do this? Any help would be really great.

Thanks,
Ashini.

If I understand this right ...

You could use getfasta from BedTools (http://bedtools.readthedocs.org/en/l.../getfasta.html) to extract regions that you are interested in as fasta. Then use @Brian's randomreads.sh program.

**abolia** · 11-12-2015, 09:17 AM

Thanks GenoMax, your answers looks good. But won't this generate a very uniform distribution around the regions specified in the BED file. I want to have more of gaussian kinda distribution for my reads. Any thoughts if it can do this?

Thanks,
Ashini.

**Brian Bushnell** · 11-12-2015, 10:17 AM

The distribution would be relatively flat. I don't have any tools that can simulate read generation from baited regions, and I don't know of any. It sounds like something you would have to write yourself. But, RandomReads generates reads annotated by their genomic origin, so it's possible to generate a flat distribution with it, then postprocess them and randomly discard reads with a probability based on the distance from the center of the nearest bait to achieve your goal. It would take a bit of work, of course.

**abolia** · 11-12-2015, 11:35 AM

Thanks Brian,
Yeah I don't think there is anything like that either. I will give it a try and try to do it myself.

Thanks for your reply.
Ashini.

**abolia** · 11-12-2015, 03:17 PM

Hi Brian,

One more last question, I have this idea and just want to confirm this with someone knowledgeable.

How about if I use getfasta to extract regions (given in the BED file) from my rearranged fasta file (lets call fasta file 1) and then use randomreads.sh to generate reads on that chopped fasta file (lets call fasta file 2, i.e. output of getfasta).
In next step, simulatenously generate more random reads from the original fasta file 1.
Now merge the two fastq files that you get from these 2 steps. So basically we are throwing some additional reads so that distribution can become more gaussian like rather than flat one.

Does this make sense? Do you think it might work or am I missing something here.

Thanks again,
Ashini.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News