(Sorry for the long thread; complicated and interesting experiment that needs some explaining. Thanks for reading!)
A couple of colleagues have recently come to the conclusion that they might have some use for sequencing in their experiments, which got me thinking how that would actually work. The types of experiments they are doing are non-bioinformatic, and I don't really know the details of them (being the only bioinformatician in an otherwise protein technology lab).
As far as I understand it, they are (most often) interested in selecting the best binder for gene X, the binder being an antibody or an alternative scaffold (for which they know the sequence, and can be produced in E. coli through transfection of a plasmid with the sequence). In order to try to get a better binder than they currently have, they randomly and/or deliberately change a number of amino acid positions in (most often) the binding site of the scaffold, which produces the library. The library is produced in E. coli, followed by several rounds of selection, where only the best binders (using variuos different criteria for what "best" is) are kept. In the end, they have a population of cells that produce, hopefully, at least one better binder than they started with. And, hopefully, most of the binders will have converged into one or several highly similar sequences, indicating that that one is, in fact, the very best they could get. I'm sure I'm getting some of this wrong, but hopefully you get the general gist of it.
They are interested in knowing the amino acid composition of the different positions as the library goes through the selection process, in order to be able to follow what positions/amino acids are important. For example, if position X starts out as 100 % Gly (non-mutated), but changes to 75/25 Gly/Asp, 50/50 Gly/Asp and then 100 % Asp in the various selection rounds. What they have always done is to simply take around 100 E. coli colonies and send them off to Sanger sequencing, and hope that what they sampled is more or less representative (if it's not it's not the entire world, seeing as what actually matters is the binder at the end and if that has better binding, as measured by downstream experiments).
They are also interested in the proportion of each sequence in the pool of sequences in each selection round. For example, they have a binder that is 300 amino acids long, and want to know how many different variants of this sequence exist in each selection round. The idea is to follow the best binder as it increases in proportion compared to the lesser binders.
Somebody said, "why don't we send it off to high-throughput sequencing instead"? They talked to some other bioinformatician they are working with, and it seems they're on their way. It got me thinking, though... how would you do this? I have some ideas, but would love to hear what you guys think!
I'm thinking that you probably wouldn't have to do any kind of alignment, and that you'd only need to count raw reads. For the first part (amino acid composition), you'd need to know where the actual sequence starts, but that should be doable by just looking at the adapters and starting with position 1 straight afterwards. Then it becomes a simple counting problem, iterating over every read and adding up the amino acids and/or nucleotides as desired. Problems would arise if the sequence is longer than you can sequence. Maybe create some custom primers that can start sequencing at a specific part of the binder sequence, thus covering the whole sequence?
The second part (proportion of unique sequences) I'm not so sure about. You'd need to do some kind of alignment, but disallowing any kind of mismatches. Again, lengths longer than the reads would mess it up, I'm guessing... You'd need to create full sequences from shorter reads, but it's not like the reads come from different genes; a lot of reads are going to be really, really similar, possibly only differing in 1-3 nucleotides, depending on the design of the binder library. This, I feel, is a more difficult problem, but maybe there's a simple solution I'm just not seeing?
A couple of colleagues have recently come to the conclusion that they might have some use for sequencing in their experiments, which got me thinking how that would actually work. The types of experiments they are doing are non-bioinformatic, and I don't really know the details of them (being the only bioinformatician in an otherwise protein technology lab).
As far as I understand it, they are (most often) interested in selecting the best binder for gene X, the binder being an antibody or an alternative scaffold (for which they know the sequence, and can be produced in E. coli through transfection of a plasmid with the sequence). In order to try to get a better binder than they currently have, they randomly and/or deliberately change a number of amino acid positions in (most often) the binding site of the scaffold, which produces the library. The library is produced in E. coli, followed by several rounds of selection, where only the best binders (using variuos different criteria for what "best" is) are kept. In the end, they have a population of cells that produce, hopefully, at least one better binder than they started with. And, hopefully, most of the binders will have converged into one or several highly similar sequences, indicating that that one is, in fact, the very best they could get. I'm sure I'm getting some of this wrong, but hopefully you get the general gist of it.
They are interested in knowing the amino acid composition of the different positions as the library goes through the selection process, in order to be able to follow what positions/amino acids are important. For example, if position X starts out as 100 % Gly (non-mutated), but changes to 75/25 Gly/Asp, 50/50 Gly/Asp and then 100 % Asp in the various selection rounds. What they have always done is to simply take around 100 E. coli colonies and send them off to Sanger sequencing, and hope that what they sampled is more or less representative (if it's not it's not the entire world, seeing as what actually matters is the binder at the end and if that has better binding, as measured by downstream experiments).
They are also interested in the proportion of each sequence in the pool of sequences in each selection round. For example, they have a binder that is 300 amino acids long, and want to know how many different variants of this sequence exist in each selection round. The idea is to follow the best binder as it increases in proportion compared to the lesser binders.
Somebody said, "why don't we send it off to high-throughput sequencing instead"? They talked to some other bioinformatician they are working with, and it seems they're on their way. It got me thinking, though... how would you do this? I have some ideas, but would love to hear what you guys think!
I'm thinking that you probably wouldn't have to do any kind of alignment, and that you'd only need to count raw reads. For the first part (amino acid composition), you'd need to know where the actual sequence starts, but that should be doable by just looking at the adapters and starting with position 1 straight afterwards. Then it becomes a simple counting problem, iterating over every read and adding up the amino acids and/or nucleotides as desired. Problems would arise if the sequence is longer than you can sequence. Maybe create some custom primers that can start sequencing at a specific part of the binder sequence, thus covering the whole sequence?
The second part (proportion of unique sequences) I'm not so sure about. You'd need to do some kind of alignment, but disallowing any kind of mismatches. Again, lengths longer than the reads would mess it up, I'm guessing... You'd need to create full sequences from shorter reads, but it's not like the reads come from different genes; a lot of reads are going to be really, really similar, possibly only differing in 1-3 nucleotides, depending on the design of the binder library. This, I feel, is a more difficult problem, but maybe there's a simple solution I'm just not seeing?
Comment