Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Andris,

    Sorry for not logging in here for a while... we've been overloaded recently. Every customer tells their friends, and so forth... We're moving into larger offices, so I hope we'll be able to hire more people, and reduce the load on us (in the short term its just adding MORE work). Now, to answer your good question:

    Is this still for SOLiD, or Illumina ?
    I'll assume SOLiD for now:
    With 25mers, if you want to detect more than 2 substitutions, you need to go to VA (Valid Adjacent) mode. This will detect up to 4 color changes, and then apply VA rules to allow up to 2 SNPs (4 color code substitutions). This takes almost twice as long as regular mode (2 color code mismatches). For 25mers, it doesn't make sense to do more than 2 mismatches w/o VA because then you artificially cause repeats which are not real repeats, in other words, you lose specificity". It does make sense to use 3 mismatches for longer read lengths.
    For example "50,3" (shorthand for "readlength=50 MaxAllowedMismatches=3) . Here are some run times

    25,2 28 minutes
    35,3 50 minutes
    35,4 44 minutes
    50,3 112 minutes
    50,4

    All the runs below are on a single old computer: 8 core (dual socket quad core) 2.0GHz Xeon with 24GB 667MHz RAM, but it does have a faster than normal hard disk (300MByte/sec). It is MUCH faster on the new Imagenix Genome Cruncher, which will be in production in about 2 weeks.

    Also, why do I ask if SOLiD or Illumina ? Illumina has much lower substitution rates for 2 reasons:
    1. A legitimate SNP only causes 1 base change (vs. two color code changes)
    2. The raw machinbe error rate is lower, or maybe they
    are just clever enough to filter out lower quality calls -
    which it doesn't look like SOLiD is doing (yet).

    so you run with a lower (MaxAllowedSubstitution) / (ReadLength) ratio on Illumina data. Three mismatches for Illumina is probably good for around 65mers or so. If you're interested, we'll run a test.

    Comment


    • #17
      Hi snetmcom,

      Sorry if I appear "pretentious and condescending" to you. I'm just trying to state the facts (and stimulate everyone else to state the facts on their end). I know a lot of people are working hard. That's how mankind progresses (from caves to next gen sequencers - it took a lot of hard working people).
      What organization are you with ? Are you already producing hundreds of millions (or billions) of sequences ? Do you currently need a cluster (and a whole day) to do alignment ? Would it be progress if you could do it on one computer in under one hour ?
      Do you know how much pollution is caused generating electricity for all those clusters that are not needed anymore ? If people just had better software, they wouldn't need "embarrasingly parallel" solutions. It also helps to have 1 good computer instead of 100 weak (performance wise, not electricity wise) ones.

      Anyway, I think the entire community would benefit if you share with us your current situation. Thanks.

      Comment


      • #18
        Originally posted by BioWizard View Post
        Sorry for not logging in here for a while... we've been overloaded recently.
        Hi,

        It's not a big surprise for such a software. I was interested in SOLID data, and you answered my possible sub-questions too, thanks!

        Comment


        • #19
          We recently posted some common benchmarks for ISAS with the new Imagenix Genome Cruncher computer, side by side with a Dell server. The Genome Cruncher runs ISAS between 2 and 3 times faster. You can see at:



          We are also organizing a 1 day workshop for ISAS users (or future ISAS users), where we will instruct in installing and running ISAS, for both Illumina and ABI. Participants are encouraged to bring their own data (on DVDs or external USB disks). If you're interested, email

          [email protected]

          Comment


          • #20
            Originally posted by BioWizard View Post
            We recently posted some common benchmarks for ISAS with the new Imagenix Genome Cruncher computer, side by side with a Dell server. The Genome Cruncher runs ISAS between 2 and 3 times faster. You can see at:



            We are also organizing a 1 day workshop for ISAS users (or future ISAS users), where we will instruct in installing and running ISAS, for both Illumina and ABI. Participants are encouraged to bring their own data (on DVDs or external USB disks). If you're interested, email

            [email protected]
            And here we go again. What is your accuracy (% of reads aligned correctly divided by the % of reads aligned)? What is your sensitivity (expected vs. observed)? How do you

            For color space, how do you do local alignment? Is it gapped? Otherwise, what heuristic do you use to find indels?

            Looks great so far; I am not surprised by the speed but it needs just a little more context before I would switch. If you need help with simulated datasets, let me know!

            Comment


            • #21
              ISAS Benchmarks Posted

              Yes, looks like here we go again. I really can't afford to spend too much time, but I'll answer this one more time as simply as I can put it:

              If the run mode is L/M, then for every sequence in the input file, ALL sequences in the reference which have a length L and up to M substitutions compared with the input sequences are reported. The only exception, is if the number of repeats exceeds the maximum number of repeats allowed (e.g. 10 or 2 shown in the benchmarks). For example:

              "25/2 max. repeats=10"
              means Length=25 , MaxSubstitutionsAllowed=2, if there are less than 10 repeats, then all of them will be reported.

              1. There is no issue of "sensitivity". In the above example, If 3 hits
              are reported, then it is a mathematical fact that there are no
              more out there with 2 or less substitutions. Only if 10 hits
              are reported (equal to the max. repeats specified then you
              know there might be more hits out there. If you care to know
              the rest, you can run with a higher repeat max.

              2. There is no "cheating". I am getting more tolerant to your skepticism,
              now that I've seen the outputs of several other (free) alignment systems.
              These are the kinds of cheating I've seen: (none of which we do)
              a. When they find one hit (which maybe they think is a "good one"),
              they stop searching, and report a unique hit, even though we
              find multiple hits for this sequence, sometimes even with the same
              number of mismatches. Then, they say they have a high percentage of
              sequences that "aligned".
              b. If they see a "difficult" sequence, they ignore it. Some call it "filtering"
              We know that by ignoring the 0.5% most diffcult sequences, we would
              approximately double the speed. Maybe we will add this as an option.
              If known to the user, and actively requested, then it is not cheating.
              c. Mask out the difficult parts (repeat) of the reference. We think
              the user should have the power to decide to ignore high repeats
              by lowering the max. repeats allowed, and not be permanently
              blinded to what the vendor considers "too many repeats" (which is
              all subject to the length and number of mismatches specified anyway).

              ISAS has a function to generate testing data, by randomly selecting sequences from the reference and adding random "sequencing errors" (or "SNPs" if you're an optimist) up to the maximum specified. For testing, we run a billion sequences after any non-trivial source code change and compilation. Each sequence is marked with its original location. If any sequence were not found, it would have indicated a bug somewhere. I assume everyone does this.

              Anyway, I really understand your skepticism now. I was amazed at the "cheating" that I saw from "famous" shareware. Maybe you had a similar experience, and became so skeptical. We don't do any cheating... our only "crime" is that we cannot give away for free

              Comment


              • #22
                Thanks for the reply. A few more things

                1.
                Originally posted by BioWizard View Post
                ...if there are less than 10 repeats, then all of them will be reported.
                What does "repeat" mean in this context? Does this mean if a 25bp read matches > 10 places with the same "best score", it is ignored? Or does this mean that if a 25bp *could* matches >10 places with up to M mismatches, it is ignored?

                2.
                Originally posted by BioWizard View Post
                There is no "cheating". I am getting more tolerant to your skepticism,
                now that I've seen the outputs of several other (free) alignment systems.
                These are the kinds of cheating I've seen: (none of which we do)
                a. When they find one hit (which maybe they think is a "good one"),
                they stop searching, and report a unique hit, even though we
                find multiple hits for this sequence, sometimes even with the same
                number of mismatches. Then, they say they have a high percentage of
                sequences that "aligned"
                I cannot speak of other aligners, but BFAST (my free aligner) doesn't have this property as an aligner, so your generalization isn't correct (didn't your mom tell you not to generalize?). Anyways, I commend you for following the same path, which reports all hits found.

                3.
                Originally posted by BioWizard View Post
                c. Mask out the difficult parts (repeat) of the reference. We think
                the user should have the power to decide to ignore high repeats
                by lowering the max. repeats allowed, and not be permanently
                blinded to what the vendor considers "too many repeats" (which is
                all subject to the length and number of mismatches specified anyway).
                This is a dangerous path, since for example structural variation occurs more frequently in repeat regions. Giving the user the option to try to align in more and more repetitive regions is very useful, though I think if I were to review a paper, I would ask the authors to align to the full reference, since the removing repetitiveness sacrifices completeness for speed.

                Also, with such a low threshold, depending on the average # hits returned, you might be introducing false-negatives.

                4.
                Originally posted by BioWizard View Post
                ISAS has a function to generate testing data, by randomly selecting sequences from the reference and adding random "sequencing errors" (or "SNPs" if you're an optimist) up to the maximum specified. For testing, we run a billion sequences after any non-trivial source code change and compilation. Each sequence is marked with its original location. If any sequence were not found, it would have indicated a bug somewhere. I assume everyone does this.
                Some sequences should not be found with ISAS since you have a hard limit of 10 for repetitive loci. Is this not true?

                Originally posted by BioWizard View Post
                ...our only "crime" is that we cannot give away for free
                It's not a crime, we all have family.

                5.
                If I could ask you to answer one question, and one question only: do you align with indels (for both Illumina and ABI)? If not, then by your criteria you are "cheating" since the the time complexity of ungapped local alignment is linear whereas with gaps is quadratic. If so, do you do this with color space too?

                Comment


                • #23
                  In the SAM output that we received from a customer (NIH) that was run using ISAS, we noticed that the names for the paired-end reads, which should be identical according to the SAM definition, contained a "/1" and a "/2" suffix to identify the reads...
                  This breaks the SAM definition I would think since the names for paired reads (if you use the RNEXT string as "=" says that the paired read's name should be the same as the first read...

                  Is there a way to turn off this suffix for paired reads?

                  Thanks,

                  Thon de Boer
                  Product manager for Strand, Makers of Avadis NGS
                  Thon
                  __________________________________
                  Thon de Boer, Ph.D.
                  Director of Product Management, Software
                  Strand Life Sciences
                  548 Market Street, Suite 82804
                  San Francisco, CA 94104, USA
                  [email protected]
                  www.strandls.com
                  Pioneers in Discovery Research Informatics
                  _______________________________________

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  27 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  26 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X