Please post what are the typical issues with data quality re:PacBio sequencing data (along with other sequencing technologies), specifically RS using C2 chemistry.
For example, I know that 454 is pretty horrendous when it comes to short mononucleotide repeats (4-10 bases) due to in-well/on-bead PCR and a single combined signal, whereas Sanger, using dNTPs typically gives well separated peaks (at least for 400 or so bases). When I'm looking at my PacBio data, I often see a disparity in repeat length between CCS reads and the "filtered" (and "error corrected) reads where the CCS reads will frequently have a repeat length 1 longer than MOST of the filtered and EC'd reads. Often these appear to be in runs of "T".
Error correction is being performed on the filtered data using PacBioToCA with the 454 data as the reference dataset. In the locations where I have issue with repeat length, most (but not all) of the 454 reads are typically the longer length, whereas the filtered (but not CCS) reads are shorter by one base. I have not ruled out that PacBioToCA is introducing a bias toward shorter repeats.
(edit: http://seqanswers.com/forums/showthread.php?t=30820 says "3" for my following question, which is now in italics)
How many times does a circular piece of DNA have to be sequenced for the PacBio algorithms to declare it as a CCS read? Is just partial overlap needed, or does the whole circle need to be sequenced many times? In other words, do I trust each CCS read about 3-5x more than an individual other filtered read? Or is the quality non-linear with respect to coverage.
Is there a sweet-spot for accuracy when it comes to total length in PacBio? Is the middle 50% more accurately read than the 25% on each "end"? Is my observation that SN-indels occur mostly with a "T" accurate? Can I use that information to bias my base call from the reverse direction (which should have a run of "A"s)?
Please let me know of other systematic biases that you know of when it comes to PacBio data
PacBio Advantages vs 454/2nd gen
Longer read
No pre-sequencing amplification step
Less subject to large indels
PB Disadvantages
Lower individual read accuracy - more read-to-read variability
More algorithmically intense to assemble
For example, I know that 454 is pretty horrendous when it comes to short mononucleotide repeats (4-10 bases) due to in-well/on-bead PCR and a single combined signal, whereas Sanger, using dNTPs typically gives well separated peaks (at least for 400 or so bases). When I'm looking at my PacBio data, I often see a disparity in repeat length between CCS reads and the "filtered" (and "error corrected) reads where the CCS reads will frequently have a repeat length 1 longer than MOST of the filtered and EC'd reads. Often these appear to be in runs of "T".
Error correction is being performed on the filtered data using PacBioToCA with the 454 data as the reference dataset. In the locations where I have issue with repeat length, most (but not all) of the 454 reads are typically the longer length, whereas the filtered (but not CCS) reads are shorter by one base. I have not ruled out that PacBioToCA is introducing a bias toward shorter repeats.
(edit: http://seqanswers.com/forums/showthread.php?t=30820 says "3" for my following question, which is now in italics)
How many times does a circular piece of DNA have to be sequenced for the PacBio algorithms to declare it as a CCS read? Is just partial overlap needed, or does the whole circle need to be sequenced many times? In other words, do I trust each CCS read about 3-5x more than an individual other filtered read? Or is the quality non-linear with respect to coverage.
Is there a sweet-spot for accuracy when it comes to total length in PacBio? Is the middle 50% more accurately read than the 25% on each "end"? Is my observation that SN-indels occur mostly with a "T" accurate? Can I use that information to bias my base call from the reverse direction (which should have a run of "A"s)?
Please let me know of other systematic biases that you know of when it comes to PacBio data
PacBio Advantages vs 454/2nd gen
Longer read
No pre-sequencing amplification step
Less subject to large indels
PB Disadvantages
Lower individual read accuracy - more read-to-read variability
More algorithmically intense to assemble
Comment