Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions about sequencing a selection library

    (Sorry for the long thread; complicated and interesting experiment that needs some explaining. Thanks for reading!)

    A couple of colleagues have recently come to the conclusion that they might have some use for sequencing in their experiments, which got me thinking how that would actually work. The types of experiments they are doing are non-bioinformatic, and I don't really know the details of them (being the only bioinformatician in an otherwise protein technology lab).

    As far as I understand it, they are (most often) interested in selecting the best binder for gene X, the binder being an antibody or an alternative scaffold (for which they know the sequence, and can be produced in E. coli through transfection of a plasmid with the sequence). In order to try to get a better binder than they currently have, they randomly and/or deliberately change a number of amino acid positions in (most often) the binding site of the scaffold, which produces the library. The library is produced in E. coli, followed by several rounds of selection, where only the best binders (using variuos different criteria for what "best" is) are kept. In the end, they have a population of cells that produce, hopefully, at least one better binder than they started with. And, hopefully, most of the binders will have converged into one or several highly similar sequences, indicating that that one is, in fact, the very best they could get. I'm sure I'm getting some of this wrong, but hopefully you get the general gist of it.

    They are interested in knowing the amino acid composition of the different positions as the library goes through the selection process, in order to be able to follow what positions/amino acids are important. For example, if position X starts out as 100 % Gly (non-mutated), but changes to 75/25 Gly/Asp, 50/50 Gly/Asp and then 100 % Asp in the various selection rounds. What they have always done is to simply take around 100 E. coli colonies and send them off to Sanger sequencing, and hope that what they sampled is more or less representative (if it's not it's not the entire world, seeing as what actually matters is the binder at the end and if that has better binding, as measured by downstream experiments).

    They are also interested in the proportion of each sequence in the pool of sequences in each selection round. For example, they have a binder that is 300 amino acids long, and want to know how many different variants of this sequence exist in each selection round. The idea is to follow the best binder as it increases in proportion compared to the lesser binders.

    Somebody said, "why don't we send it off to high-throughput sequencing instead"? They talked to some other bioinformatician they are working with, and it seems they're on their way. It got me thinking, though... how would you do this? I have some ideas, but would love to hear what you guys think!

    I'm thinking that you probably wouldn't have to do any kind of alignment, and that you'd only need to count raw reads. For the first part (amino acid composition), you'd need to know where the actual sequence starts, but that should be doable by just looking at the adapters and starting with position 1 straight afterwards. Then it becomes a simple counting problem, iterating over every read and adding up the amino acids and/or nucleotides as desired. Problems would arise if the sequence is longer than you can sequence. Maybe create some custom primers that can start sequencing at a specific part of the binder sequence, thus covering the whole sequence?

    The second part (proportion of unique sequences) I'm not so sure about. You'd need to do some kind of alignment, but disallowing any kind of mismatches. Again, lengths longer than the reads would mess it up, I'm guessing... You'd need to create full sequences from shorter reads, but it's not like the reads come from different genes; a lot of reads are going to be really, really similar, possibly only differing in 1-3 nucleotides, depending on the design of the binder library. This, I feel, is a more difficult problem, but maybe there's a simple solution I'm just not seeing?
    Last edited by ErikFas; 11-17-2016, 12:27 AM.

  • #2
    The experiment is feasible but probably not ideal due to technical limitations. The 300 amino acid gene is 900bp long, which exceeds the read length of the most common platforms, but the longer-read platforms have high error rates that make them unsuitable for variant analysis. So the best option would be to sequence the gene as three 300bp amplicons, using paired-end 300bp sequencing for error correction (to detect low-frequency variants). But you would lose connectivity information between the amplicons (which may be important if distal variants are co-dependent) and, given the nearly identical sequences, there's no easy way to resolve that problem. So, all of the analyses would be at the amplicon (not full gene) level, although you'll be able to make inferences based on relative frequencies (which you may decide to validate by limited Sanger sequencing).

    For the proportions of unique sequences, a simple string frequency counter would suffice. For amino acid analysis, you'd need to translate the sequences b/c of degeneracy in the genetic code. Then, it would be trivial to count the frequency of each amino acid at each position. But some changes are likely to be interdependent (even within an amplicon), so it would probably be more useful to discriminate haplotypes (perhaps for only the most abundant subset of variants).

    Comment


    • #3
      Thank you for the response! What would be the longest gene in base pairs you feel could be sequenced, then? The platform that is being discussed gives 350 bp reads, if I heard them correctly.

      Comment


      • #4
        Current sequencer specs can be found here. But you'll need overlapping paired-end data for error correction, which means 300bp max on the MiSeq. Longer amplicons are possible with partial read overlap, at the cost of increased errors in the non-overlapping ends.

        Since the instrument will produce MUCH more data than you'll need, you may be able to recover some haplotype information from overlapping amplicons (e.g., 1-300bp, 150-450, 300-600, 450-750, and 600-900). The only added expense is library construction, which is minimal (primers for PCR). But my guess is that their utility will be limited, given the sequence similarity.
        Last edited by HESmith; 11-17-2016, 06:11 AM.

        Comment


        • #5
          A guy in my lab space (Jim Stapleton, he is an independent researcher) has a long pseudo-molecule approach that might be what you want:
          Next-generation DNA sequencing has revolutionized the study of biology. However, the short read lengths of the dominant instruments complicate assembly of complex genomes and haplotype phasing of mixtures of similar sequences. Here we demonstrate a method to reconstruct the sequences of individual nucleic acid molecules up to 11.6 kilobases in length from short (150-bp) reads. We show that our method can construct 99.97%-accurate synthetic reads from bacterial, plant, and animal genomic samples, full-length mRNA sequences from human cancer cell lines, and individual HIV env gene variants from a mixture. The preparation of multiple samples can be multiplexed into a single tube, further reducing effort and cost relative to competing approaches. Our approach generates sequencing libraries in three days from less than one microgram of DNA in a single-tube format without custom equipment or specialized expertise.


          Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing

          He is using it for exactly what you describe, to get full haplotypes of variants too long for existing read lengths with high accuracy. I don't know if he wants his current e-mail posted on a web site, so message me if you want to follow up.
          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

          Comment


          • #6
            The approach recommended by @SNPsaurus is conceptually similar to a low-throughput Moleculo-type library, and is definitely applicable for the in silico assembly of longer (~10e4) fragments. However, it's unclear how useful it would be for the OP's application. The method requires unique 5' and 3' barcodes for each clone to be sequenced, which is a practical limit on the number of clones to screen. The scale of that approach is not significantly greater than the existing method of ~100 Sanger-sequenced clones, and the latter is undoubtedly cheaper and easier to analyze computationally.

            Comment


            • #7
              The difference between a low-throughput Moleculo library and the method I linked to is that each long DNA molecule is tagged by a randomer which is then copied onto the short derivative fragments needed for sequencing on Illumina. Jim sequences libraries of >100,000 long DNA molecules and gets the full haplotype of each, so it seems more suitable for assessing the presence of different variants in a complex library when those variants are separated by moderately long distances.
              Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

              Comment


              • #8
                By conceptually similar to Moleculo, I meant that the short reads derived from a single long fragment are identified by the presence of a unique barcode/index. But I can see how this method scales much better than Moleculo, in that the 5' and 3' barcodes are randomly ligated and the matching pairs determined by sequencing. I also like the mate-pair-style fragmentation and circularization to randomize the flanking sequences - clever. Thanks for the reference and clarification.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin


                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                  Today, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                37 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                41 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                35 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                54 views
                0 likes
                Last Post seqadmin  
                Working...
                X