Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pindel Algorithm Explanation

    Hey,

    I am looking at the Kai Ye paper on Pindel:



    and am not sure about some of what the algorithm is actually doing. Specifically, the numbered bullet points in 2.3 when looking for large deletions:



    (1)Read in the location and the direction of the mapped read from the mapping result obtained in the preprocessing step;

    (2) Define 3' end of the mapped read as the anchor point;

    (3) Use pattern growth algorithm to search for minimum and maximum unique substrings from the 3′ end of the unmapped read within the range of two times of the insert size from the anchor point;

    (4) Use pattern growth to search for minimum and maximum unique substrings from the 5′ end of the unmapped read within the range of read length + Max_D_Size starting from the already mapped 3′ end of the unmapped read obtained in step 3;

    (5) Check whether a complete unmapped read can be reconstructed combining the unique substrings from 5′ and 3′ ends found in steps 3 and 4. If yes, store it in the database U. Note that exact matches and complete reconstruction of the unmapped read are required so that neither gap nor substitution is allowed.



    Initially, I am not sure about the geometry of (3). Searching for substrings from the 3' end of the read in the range of 2* insert size from the anchor point.

    Specifically?

    How does one search for substrings from the 3' end of a read - surely this is the end of the sequence?
    It seems as though the insert size is the average insert size of insertions, but it is not clear that this is what was meant.

    Does anyone have any intuition on this paper / the method used?

    Cheers!

  • #2
    DNA fragments are double stranded and the read data generated will map to one of the strand. And DNA is synthesized from 5' to 3'. Please google DNA strand and I put one search result below, although it might not be clear. You could search youtube about illumina's sequencing tech to learn more about it.

    The paired-end sequencing in Illumina solexa reads.

    5' 3'
    --------------->
    ____________________________________ DNA
    ____________________________________
    3' 5'
    <-----------------

    Comment


    • #3
      I understand what 3' and 5' are, but am just finding the wording quite vague.

      assuming the unmapped is the right read it seems to be this

      _________3' (mapped read)------------------___________3'(unmapped read)


      but to me it is really not clear what domain you search in from the subsequent text. Presumably, you must search backwards towards the mapped read?

      Cheers,

      Comment


      • #4
        please check my ppt at http://www.ebi.ac.uk/~kye/pindel/pin...009_june28.ppt

        There is one animation in the slides about the mapping procedure. let me know if you have any questions after going through the slides.

        Comment


        • #5
          Thanks very much! I will check this out early next week.

          Comment


          • #6
            Ok Kai, I have looked through the presentation and to understand it better but am still not 100% sure about the process.

            Here is how I think the geometry is working, and would really appreciate your input on the correctness of this.

            3) Basically running the algorithm to find the substrings on the reference. The bit I am not sure about is the domain on the reference that you are using as the sequence database. I think it is from the 3' of the mapped read to 3'+ 2* the average spacing in between the paired end reads (is this what you mean by average insert length?) - I am afraid I am not sure why you chose this number - is it a heuristic?

            From this you can obtain the locations of minimum and maximum substrings on the reference. In the case of deletions (with the break point located within the read), you would not expect the maximum substring to span the length of the read, as the read is missing letters.

            Now you have marked the maximum unique substring, you can start looking for the other piece of the read.

            4) From this point on the reference, you can then run the pattern growth algo again to hopefully find the other matching section. I think this is pretty self explanatory, as the region of interest is just the user controlled parameter which you may want to adjust based on the sensitivity calculations you have done later on etc.


            As a final q, what is the relevance of finding the minimum substring? I would have thought that finding two maximum substrings would have been sufficient - maybe this becomes more obvious when you try to implement it but I feel like I am missing a subtlety here.


            Thank you very much for taking the time to read this, Cheers!

            Comment


            • #7
              It is heuristic to choose 2 times insert size. Both maximum and minimum substring define the range of read being split while mapping back to the reference genome correctly. Due to local repeats around and at the breakpoints, there are more than one solution to split the read and align the two fragments to the reference genome.

              Please consider the following case

              reference seq
              GCACATATATATGGAAC

              read seq
              GCACATATATGGAAC

              the split read solution space
              GCAC__ATATATGGAAC
              GCACA__TATATGGAAC
              GCACAT__ATATGGAAC
              GCACATA__TATGGAAC
              GCACATAT__ATGGAAC
              GCACATATA__TGGAAC
              GCACATATAT__GGAAC

              We often use the fist one as the correct solution, to left align the variant.

              Originally posted by Jeromek View Post
              Ok Kai, I have looked through the presentation and to understand it better but am still not 100% sure about the process.

              Here is how I think the geometry is working, and would really appreciate your input on the correctness of this.

              3) Basically running the algorithm to find the substrings on the reference. The bit I am not sure about is the domain on the reference that you are using as the sequence database. I think it is from the 3' of the mapped read to 3'+ 2* the average spacing in between the paired end reads (is this what you mean by average insert length?) - I am afraid I am not sure why you chose this number - is it a heuristic?

              From this you can obtain the locations of minimum and maximum substrings on the reference. In the case of deletions (with the break point located within the read), you would not expect the maximum substring to span the length of the read, as the read is missing letters.

              Now you have marked the maximum unique substring, you can start looking for the other piece of the read.

              4) From this point on the reference, you can then run the pattern growth algo again to hopefully find the other matching section. I think this is pretty self explanatory, as the region of interest is just the user controlled parameter which you may want to adjust based on the sensitivity calculations you have done later on etc.


              As a final q, what is the relevance of finding the minimum substring? I would have thought that finding two maximum substrings would have been sufficient - maybe this becomes more obvious when you try to implement it but I feel like I am missing a subtlety here.


              Thank you very much for taking the time to read this, Cheers!

              Comment


              • #8
                I see, thank you very much. So by always min for the 5' end and max for the 3' you can avoid this problem. And just to check, the average insert length does mean average distance between pair reads?

                Cheers!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-27-2024, 06:37 PM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-27-2024, 06:07 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                68 views
                0 likes
                Last Post seqadmin  
                Working...
                X