Hey all,
I ask as I'm trying to set up a fully automated pipeline where I anticipate having to merge data files in the future. If we sequence the same library more then once, we want to merge and then remove duplicates. If we sequence the same samples in different libraries, we want to remove duplicates and then merge.
But what if we sequence a sample once with one library, make a new library and remove duplicates before merging, and then sequence that second library again? Now, as the first two libraries are already mixed, it wouldn't be possible to just merge the second data set with the third and remove duplicates prior to merging with the first, unless all of the individual files are kept (of course if the sequence files are available it can be reanalyzed as such as well).
If Mark Duplicates handled different read groups (or libraries) separately within one bam file, this wouldn't be an issue as long as each library was given a different read group (or library tag).
So, that's where my question is coming from. Hopefully this situation won't arise but it'd be awesome if I could magically anticipate and build in code to handle every possible situation.
I ask as I'm trying to set up a fully automated pipeline where I anticipate having to merge data files in the future. If we sequence the same library more then once, we want to merge and then remove duplicates. If we sequence the same samples in different libraries, we want to remove duplicates and then merge.
But what if we sequence a sample once with one library, make a new library and remove duplicates before merging, and then sequence that second library again? Now, as the first two libraries are already mixed, it wouldn't be possible to just merge the second data set with the third and remove duplicates prior to merging with the first, unless all of the individual files are kept (of course if the sequence files are available it can be reanalyzed as such as well).
If Mark Duplicates handled different read groups (or libraries) separately within one bam file, this wouldn't be an issue as long as each library was given a different read group (or library tag).
So, that's where my question is coming from. Hopefully this situation won't arise but it'd be awesome if I could magically anticipate and build in code to handle every possible situation.
Comment