Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to set mismatches?

    For SOLiD color-space mapping, we usually allow 2 mismatches for 25bp, 3 mismatches for 35bp. With SOLiD v3, the read length goes to 50bp. What is the max mismatches for 50bp. How about 75bp? And standard for setting the number of mismatches?

    Many thanks,

  • #2
    Originally posted by jlli View Post
    For SOLiD color-space mapping, we usually allow 2 mismatches for 25bp, 3 mismatches for 35bp. With SOLiD v3, the read length goes to 50bp. What is the max mismatches for 50bp. How about 75bp? And standard for setting the number of mismatches?

    Many thanks,
    I usually aim for ~10% color error rate. Trying to map with a higher sensitivity (tolerance to errors) gives diminishing returns as well as making me think: "do I really trust reads with >10% error?".

    Comment


    • #3
      One does have to allow for SNP detection which, in color space, is a two mismatch setting. Thus 25 bp with 2 mismatches means that you will be looking finding SNPs only in otherwise perfectly matching reads. Certainly this can done but you do risk throwing away data. Since color-space is robust to non-adjacent single color-space errors (i.e., such errors can be corrected for) and even adjacent double color-space errors are 3/4th robust then going to slightly larger error rates can be worthwhile.

      Nils' rule of thumb (~10%) is not bad but fails to come up with a firm number when the read length ends with a '5'; e.g. 25 or 35 or 75.

      Personally I used 3 mismatches for 25 bp paired end runs -- slightly higher than 10% but the pairing should take care of any problems caused by a higher number of badly placed reads; i.e., both the F3 and R3 reads have to be badly placed and within the fragment length to make the entire pair bad -- a rare occurrence.

      For 35bp fragment runs I would also use 3 mismatches. This gives me the chance to detect a SNP within a read with a single error. Since I don't have a corresponding pair as a double check I would prefer to be conservative.

      Now that we have SOLiD v.3 chemistry with 50 bp fragment and paired end runs, I am using 5 mismatches for the fragment runs and 6 mismatches for the paired end runs. If I had good references (e.g., a human run vs. the human reference -- sadly most of my non-model-organism references are poor) I could be convinced to go down to a slightly lower mismatch level for safety's sake.

      The above, obviously, is for matching to a reference and not de-novo assembly.

      ABI/Lifetech recommends 2mm for 25mers, 3mm for 35mer and 5mm for 50mer. But they mainly work with well defined model organisms.

      Comment


      • #4
        Originally posted by jlli View Post
        What is the max mismatches for 50bp. How about 75bp?
        Ah, just re-read your message and saw that you asked the above specific questions. If you use the SOLiD software the schemas that they have put limits on the mismatches. Some of the ranges of mismatches are:

        35 bp -- 0 to 8 mismatches

        50 bp -- 0 to 7 mismatches; yes, lower than 35 bp limit

        65 bp -- 0 to 7 mismatches; don't know why they do not go higher

        75 bp -- 0 to 9 mismatches

        80 bp -- either 0 or 8 mismatches

        100 bp -- either 0, 10 or 15 mismatches.


        I suspect that some of the schemas are simply there for testing. Certainly the ones above 50 are not currently supported. The idea of doing 8 mismatches on a 35 bp run would be bizarre; not to mention that such a search would take about 60 times longer than a 3 mismatch search.

        I find it interesting that a long bp run with a large number of mismatches does not always take much longer than using a short bp run. Processing a 50bp - 5 mismatch run takes just a little bit longer than a 35 bp - 3 mismatch run.

        Anyway the above is just for the SOLiD software. Other people's software that works in colorspace (e.g., Nils') may have other limitations.

        Comment


        • #5
          Originally posted by westerman View Post
          Other people's software that works in colorspace (e.g., Nils') may have other limitations.
          BFAST, for example, can be tuned to be sensitive to any combination of errors or variants (indels, SNPs etc.) so by design its only limitation is speed. Other aligners, including corona or BWA, have the same ability.

          For our own internal purposes, for 50bp data we target >90% power for 6 color errors (12% error rate). The power for 7 color error rate is not zero (no hard limit) as the power gradually decreases.

          Have you also considered searching for indels directly?

          Comment


          • #6
            Thanks for the replies, these really clear up my confusion. In my study, I only want to map the reads to reference genome so far. I used same criteria as westerman (2 for 25bp, 3 for 35bp, 6 for mate-pair (25bp for each mate)). The mismatch-tolerant includes both assumed sequences-error rate and SNP-error rate. The chemistry error rate for v2 is about 0.075%, how about the sequence error rate?

            Comment


            • #7
              Originally posted by jlli View Post
              Thanks for the replies, these really clear up my confusion. In my study, I only want to map the reads to reference genome so far. I used same criteria as westerman (2 for 25bp, 3 for 35bp, 6 for mate-pair (25bp for each mate)). The mismatch-tolerant includes both assumed sequences-error rate and SNP-error rate. The chemistry error rate for v2 is about 0.075%, how about the sequence error rate?
              The color error rate (chemistry error rate) varies from 1% up to 12% (after the 50th measurement). The sequence error rate depends directly on how you align and decode the read since you need to identify variants (SNPs and indels) at the same time as you identify the color errors (chemistry errors). It also depends on the polymorphism rate of what you are sequencing. Subtracting the polymorphism rate, we see <1% sequence error rate.

              Nils

              Comment


              • #8
                read the SOLiD SAT file, there is a table for progressive mapping (page 100). you can see the different mismatches with read length refer to a "Effective Length". The lowest "Effective Length" should be greator than 18 to avoid randomly matches error.
                you can choose the setting following by the table.

                Comment


                • #9
                  Originally posted by nilshomer View Post
                  The color error rate (chemistry error rate) varies from 1% up to 12% (after the 50th measurement). The sequence error rate depends directly on how you align and decode the read since you need to identify variants (SNPs and indels) at the same time as you identify the color errors (chemistry errors). It also depends on the polymorphism rate of what you are sequencing. Subtracting the polymorphism rate, we see <1% sequence error rate.

                  Nils
                  you gents are using the q=-10log(m/n) eq. for this- correct?

                  Comment


                  • #10
                    Originally posted by JDS View Post
                    read the SOLiD SAT file, there is a table for progressive mapping (page 100). you can see the different mismatches with read length refer to a "Effective Length". The lowest "Effective Length" should be greator than 18 to avoid randomly matches error.
                    you can choose the setting following by the table.
                    Where could I attain the SOLiD SAT file? Went to ABI's website and searched to no avail.

                    Comment


                    • #11
                      Nil,

                      I am using bfast to align illumina reads which has 100bp length. What is the exact number of mismatches that bfast allow? From your post, it seems that the mismatches are 12 (12% after 50 bp). Please let me know if I am wrong.

                      Comment


                      • #12
                        There is no minimum or maximum # of mismatches allowed. It is probabilistic in nature.

                        Comment


                        • #13
                          nils,

                          I mapped 60 million illumina reads (101 bp) to mosue genome. I found the maximum mismatch is 21. Therefore, the ratio is 21%.

                          Could you explain a little more on the probabilistic nature.

                          Comment


                          • #14
                            I would recommend reading the paper.

                            It all depends on where the errors occur in the reads and the set of masks you use for your indexes. Given a certain error rate, the masks will be able to find a k-mer that contains no errors with a certain probability over the space of possible error configurations. Also factor in the repetitiveness of that k-mer.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-27-2024, 06:37 PM
                            0 responses
                            12 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-27-2024, 06:07 PM
                            0 responses
                            11 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X