Hi all,
I think the thread title says it all but below are a bit more details about my problem:
I have used PacBio sequence capture on plant samples and, when looking at my bam files (with PacBio reads onto my reference genome), I can see that there are quite a lot of duplicate reads (starting and ending at the same positions).
I know how to remove duplicate reads when they are Illumina (using samtools rmdup) but not when they are PacBio. This is because samtools rmdup considers reads as duplicates when they have the same start and end positions and when their sequences are identical.
However, because the frequency of sequencing errors is higher in PacBio compared to Illumina, the second criterion (identical sequences) is usually false even for real duplicates.
What I would like to have is a program or script that would look for reads that have the same mapping position and probably also less than X% mismatches to then keep one of them (maybe keep the one that is most similar to the reference?). However, I was unlucky in my research.
Do you know of anything like this?
Many thanks!
Agathe
P.S. the PacBio reads are CCS reads (the consensus of at least 3 subreads)
I think the thread title says it all but below are a bit more details about my problem:
I have used PacBio sequence capture on plant samples and, when looking at my bam files (with PacBio reads onto my reference genome), I can see that there are quite a lot of duplicate reads (starting and ending at the same positions).
I know how to remove duplicate reads when they are Illumina (using samtools rmdup) but not when they are PacBio. This is because samtools rmdup considers reads as duplicates when they have the same start and end positions and when their sequences are identical.
However, because the frequency of sequencing errors is higher in PacBio compared to Illumina, the second criterion (identical sequences) is usually false even for real duplicates.
What I would like to have is a program or script that would look for reads that have the same mapping position and probably also less than X% mismatches to then keep one of them (maybe keep the one that is most similar to the reference?). However, I was unlucky in my research.
Do you know of anything like this?
Many thanks!
Agathe
P.S. the PacBio reads are CCS reads (the consensus of at least 3 subreads)
Comment