Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by wilflugo View Post

    Seems I have not being able to generate a set of reads which could be exact matches. This is what I am trying to do:

    $ ./randomreads.sh ref=<my_reference> maxsnps=0 maxinss=0 maxdels=0 maxsubs=0 adderrors=false out=reads.fastq reads=1000 minlength=18 maxlength=55 seed=-1

    Seems that around 45% of the reads generated are exact matches (calculated by doing grep file matches of the generated read against the reference) but not all of them are. Are there any other parameters that could be added to for exact reads? I could create a simple program myself but I want to continue using random reads for adding SNPs once my base is obtained.

    I am assuming that after I can get an exact read then I will just be adding SNPs (using maxsnps parameter). Correct?
    Using that command I got 100% of reads perfectly matching the reference:

    C:\temp\ecoli>java -ea -Xmx1g align2.RandomReads3 maxsnps=0 maxinss=0 maxdels=0 maxsubs=0 adderrors=false out=reads.fastq reads=1000 minlength=18 maxlength=55 seed=-1
    snpRate=0.0, max=0, unique=true
    insRate=0.0, max=0, len=(0-0)
    delRate=0.0, max=0, len=(0-0)
    subRate=0.0, max=0, len=(0-0)
    nRate =0.0, max=0, len=(0-0)
    genome=1
    PERFECT_READ_RATIO=0.0
    ADD_ERRORS_FROM_QUALITY=false
    REPLACE_NOREF=false
    paired=false
    read length=18-55
    Wrote reads.fastq
    Time: 0.344 seconds.

    C:\temp\ecoli>java -ea -Xmx1g align2.BBMap in=reads.fastq
    Executing align2.BBMap [in=reads.fastq]

    BBMap version 34.94
    Retaining first best site only for ambiguous mappings.
    No output file.
    Set genome to 1

    Loaded Reference: 0.057 seconds.
    Loading index for chunk 1-1, build 1
    Generated Index: 0.924 seconds.
    Analyzed Index: 1.860 seconds.
    Cleared Memory: 0.188 seconds.
    Processing reads in single-ended mode.
    Started read stream.
    Started 8 mapping threads.
    Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7

    ------------------ Results ------------------

    Genome: 1
    Key Length: 13
    Max Indel: 16000
    Minimum Score Ratio: 0.56
    Mapping Mode: normal
    Reads Used: 1000 (36146 bases)

    Mapping: 0.275 seconds.
    Reads/sec: 3630.35
    kBases/sec: 131.22


    Read 1 data: pct reads num reads pct bases num bases

    mapped: 100.0000% 1000 100.0000% 36146
    unambiguous: 96.8000% 968 96.9181% 35032
    ambiguous: 3.2000% 32 3.0819% 1114
    low-Q discards: 0.0000% 0 0.0000% 0

    perfect best site: 100.0000% 1000 100.0000% 36146
    semiperfect site: 100.0000% 1000 100.0000% 36146

    Match Rate: NA NA 100.0000% 36146
    Error Rate: 0.0000% 0 0.0000% 0
    Sub Rate: 0.0000% 0 0.0000% 0
    Del Rate: 0.0000% 0 0.0000% 0
    Ins Rate: 0.0000% 0 0.0000% 0
    N Rate: 0.0000% 0 0.0000% 0

    Total time: 3.521 seconds.
    Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly.

    You can generate the same set of reads with and without SNPs by fixing the seed to a positive number, like this:

    randomreads.sh maxsnps=0 adderrors=false out=perfect.fastq reads=1000 minlength=18 maxlength=55 seed=5

    randomreads.sh maxsnps=2 snprate=1 adderrors=false out=2snps.fastq reads=1000 minlength=18 maxlength=55 seed=5


    The RNG for the SNPs and positions are independent.

    Comment


    • #17
      Originally posted by Brian Bushnell View Post
      Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly.
      This is my problem then. This makes perfect sense know I think about it. I was incorrectly assuming all reads generated where from the same strand as in the genome.

      Thanks again.

      Comment


      • #18
        Originally posted by Brian Bushnell View Post
        I wrote a program for that purpose; it's part of BBTools. Basic usage:

        randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000

        You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.

        For usage information, run the shellscript with no arguments (or open it with a text editor).

        I also have a couple of programs for grading sam files generated using these reads by parsing the read names (samtoroc.sh and gradesam.sh).
        Will this work using a transcriptome in FASTA format as input?

        Comment


        • #19
          Yes, it will. Any fasta is acceptable. You can't do anything regarding custom differential expression, though; it tries to generate a flat distribution.
          Last edited by Brian Bushnell; 11-08-2015, 03:51 PM.

          Comment


          • #20
            Originally posted by Brian Bushnell View Post
            Yes, it will. Any fasta is acceptable. You can't do anything regarding custom differential expression, though; it tries to generate a flat distribution.
            Great! That's perfect - I need to "shred" a transcriptome so that I can use it to test a tool I'm making for molecular indexing. Thanks

            Comment


            • #21
              Incidentally, there's another tool that will do that too, Shred:

              shred.sh in=ref.fasta out=reads.fastq length=200


              The difference is that RandomReads will make reads in a random order from random locations, ensuring flat coverage on average, but it won't ensure 100% coverage unless you generate many fold depth. Shred, on the other hand, gives you exactly 1x depth and exactly 100% coverage (and is not capable of modelling errors). So, the use-cases are different.

              Comment


              • #22
                (Hopefully) last question(s):

                I'd like to take a transcriptome (in the form of a file that contains all the FASTAs) - actually the ZF transcriptome and in silico fragment it into 250 bp "insert" sizes. Then I'd like to generate a pair of 100 bp PE reads from each fragment.

                In other words, if I have 100,000 fragments of 250 bp, I'd like to end up with 200,000 PE reads - one set corresponding to each of the 100,000 fragments. I know this is artificial, but because we're trying to check code, we'd like to be very defined and controlled in this first test.

                Thanks,
                Andor

                Comment


                • #23
                  This depends on exactly what you want to do with transcripts shorter than 250bp, but... there are 2 ways to do this:

                  randomreads.sh ref=transcriptome.fa out=synth.fq.gz reads=100000 len=100 paired interleaved mininsert=250 maxinsert=250

                  Or, if you shred it some way so you already have 250bp single-ended reads:

                  bbfakereads.sh in=shreds.fq out=pairs.fq length=100

                  I wrote that specifically for this purpose Incidentally, both of these commands will produce interleaved reads; you can convert between interleaved and dual-file paired, or between fasta and fastq, with reformat.sh, if you have things in the wrong format.

                  The first command will only produce inserts of exactly 250bp, and the second will only produce inserts of exactly the length of the input sequences.

                  Comment


                  • #24
                    Ok, thanks!

                    Comment


                    • #25
                      Hi all,
                      I want to generate reads for one of my rearranged genomes using a list of genomic coordinates in a BED file. The whole idea is to generate random DNA fragments from designated target regions.

                      I tried using Wessim simulator, but it throws me bunch of errors and when I tried contacting their team, they said they no longer work on that project.

                      Does anyone has any idea how to do this? Any help would be really great.

                      Thanks,
                      Ashini.

                      Comment


                      • #26
                        Originally posted by abolia View Post
                        Hi all,
                        I want to generate reads for one of my rearranged genomes using a list of genomic coordinates in a BED file. The whole idea is to generate random DNA fragments from designated target regions.

                        I tried using Wessim simulator, but it throws me bunch of errors and when I tried contacting their team, they said they no longer work on that project.

                        Does anyone has any idea how to do this? Any help would be really great.

                        Thanks,
                        Ashini.
                        If I understand this right ...

                        You could use getfasta from BedTools (http://bedtools.readthedocs.org/en/l.../getfasta.html) to extract regions that you are interested in as fasta. Then use @Brian's randomreads.sh program.

                        Comment


                        • #27
                          Thanks GenoMax, your answers looks good. But won't this generate a very uniform distribution around the regions specified in the BED file. I want to have more of gaussian kinda distribution for my reads. Any thoughts if it can do this?

                          Thanks,
                          Ashini.

                          Comment


                          • #28
                            The distribution would be relatively flat. I don't have any tools that can simulate read generation from baited regions, and I don't know of any. It sounds like something you would have to write yourself. But, RandomReads generates reads annotated by their genomic origin, so it's possible to generate a flat distribution with it, then postprocess them and randomly discard reads with a probability based on the distance from the center of the nearest bait to achieve your goal. It would take a bit of work, of course.

                            Comment


                            • #29
                              Thanks Brian,
                              Yeah I don't think there is anything like that either. I will give it a try and try to do it myself.

                              Thanks for your reply.
                              Ashini.

                              Comment


                              • #30
                                Hi Brian,

                                One more last question, I have this idea and just want to confirm this with someone knowledgeable.

                                How about if I use getfasta to extract regions (given in the BED file) from my rearranged fasta file (lets call fasta file 1) and then use randomreads.sh to generate reads on that chopped fasta file (lets call fasta file 2, i.e. output of getfasta).
                                In next step, simulatenously generate more random reads from the original fasta file 1.
                                Now merge the two fastq files that you get from these 2 steps. So basically we are throwing some additional reads so that distribution can become more gaussian like rather than flat one.

                                Does this make sense? Do you think it might work or am I missing something here.

                                Thanks again,
                                Ashini.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X