Hi, All,
I have a question for de-duplicate for the bisulfite sequencing data to remove PCR atfifacts.
There are mainly two methods to do this.
1. The Lister method. This method consider mapped reads with identical 5' end (start site) as "clonal" reads, only keep the one with highest sequencing quality scores, and exclude others.
For example:
ATATATCGTAGTGGACCGTAACTGACGTTTTCAGC
-------------------
------------------
-----------------
----------------
2. The method used in deduplicate_bismark_alignment_output.pl of the Bismark package.
In the 163 line of this perl file, my $composite = join (":",$strand,$chr,$start,$end);
If I understand correctly, this mean that reads with identical start and end sites were considered as duplicates.
For example,
ATATATCGTAGTGGACCGTAACTGACGTTTTCAGC
----------------
----------------
----------------
----------------
However, both of these two methods do not consider the mapping status.
For example, can we consider the four reads bellow as duplicates?
ATATATCGTAGTGGACCGTAACTGACGTTTTCAGC
-----C----------
-----T----------
-----T----------
-----C----------
Thanks.
Jerry
I have a question for de-duplicate for the bisulfite sequencing data to remove PCR atfifacts.
There are mainly two methods to do this.
1. The Lister method. This method consider mapped reads with identical 5' end (start site) as "clonal" reads, only keep the one with highest sequencing quality scores, and exclude others.
For example:
ATATATCGTAGTGGACCGTAACTGACGTTTTCAGC
-------------------
------------------
-----------------
----------------
2. The method used in deduplicate_bismark_alignment_output.pl of the Bismark package.
In the 163 line of this perl file, my $composite = join (":",$strand,$chr,$start,$end);
If I understand correctly, this mean that reads with identical start and end sites were considered as duplicates.
For example,
ATATATCGTAGTGGACCGTAACTGACGTTTTCAGC
----------------
----------------
----------------
----------------
However, both of these two methods do not consider the mapping status.
For example, can we consider the four reads bellow as duplicates?
ATATATCGTAGTGGACCGTAACTGACGTTTTCAGC
-----C----------
-----T----------
-----T----------
-----C----------
Thanks.
Jerry
Comment