Hi guys,
So our HiSeq data is showing a large number of duplicate sequences. I've come across tools like Picard MarkDups or samtools rmdup which remove duplicates - however they seem to require alignment to a reference genome and use position information to perform the removal.
Is there some way of performing duplicate removal without using alignment to a reference? (since we don't have a reference!) A naive pair-wise comparison of all sequences to each other would probably take too much time, and not account for localized errors as well, correct ? Should I use a hashtable to store all the sequences and then perform a constant time lookup for each sequence ? Or am I missing an easy way of doing this ?
Thanks!
So our HiSeq data is showing a large number of duplicate sequences. I've come across tools like Picard MarkDups or samtools rmdup which remove duplicates - however they seem to require alignment to a reference genome and use position information to perform the removal.
Is there some way of performing duplicate removal without using alignment to a reference? (since we don't have a reference!) A naive pair-wise comparison of all sequences to each other would probably take too much time, and not account for localized errors as well, correct ? Should I use a hashtable to store all the sequences and then perform a constant time lookup for each sequence ? Or am I missing an easy way of doing this ?
Thanks!
Comment