I am pretty new to NGS and PacBio in particular, so I may be missing something fundamental. My question is regarding the LSC PacBio correction algorithm. In figure 1 of the below paper, it is briefly explained how data is transformed before mapping and correction.
Basically adjacent bases that are the same are condensed down to one base. One of the examples given in the paper is:
CCTAGTTACCGAT --> CTAGTACGAT
Then the condensed version of the short reads are mapped to the condensed version of the PacBio reads after which the PacBio reads are corrected. Only after errors are detected are the reads decompressed. My question is, how are many of the possible errors detected in this condensed form?
Consider these modifications to the sample sequence:
CCTAGTACCGAT --> CTAGTACGAT single deletion
CCTAGTTTACCGAT --> CTAGTACGAT single insert
CCTAGTAACCGAT --> CTAGTACGAT single base error
All of these erroneous sequences result in the exact same compressed form. The number of sequences that would reduce down to the same condensed form is astronomical. How are all those possible error combinations detected?
Basically adjacent bases that are the same are condensed down to one base. One of the examples given in the paper is:
CCTAGTTACCGAT --> CTAGTACGAT
Then the condensed version of the short reads are mapped to the condensed version of the PacBio reads after which the PacBio reads are corrected. Only after errors are detected are the reads decompressed. My question is, how are many of the possible errors detected in this condensed form?
Consider these modifications to the sample sequence:
CCTAGTACCGAT --> CTAGTACGAT single deletion
CCTAGTTTACCGAT --> CTAGTACGAT single insert
CCTAGTAACCGAT --> CTAGTACGAT single base error
All of these erroneous sequences result in the exact same compressed form. The number of sequences that would reduce down to the same condensed form is astronomical. How are all those possible error combinations detected?