Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sequencing low diversity samples on the MiSeq

    For TL;DR, just skip to the pictures at the bottom of the post.

    Not sure if everyone even knows what "low diversity" means in this context. Let me give you a worst case scenario: we use the MiSeq to sequence PCR product derived from 16S V3 loop primers. What this implies is that if we take no other action, and just cluster and run these amplicons, over the first 20 bases of sequence every single cluster will read exactly the same base -- those bases from the V3 loop primer itself. That is low sample diversity -- zero sample diversity in this extreme case.

    No need to suggest work-arounds to me, I think I am familiar with them all. Here I just want to give you a "case study" and a little background on what I would call the current state-of-the-art.

    Please not that topic has been addressed in other threads. Nothing here is particularly new or shocking. But I think an additional data point will be helpful.

    If one wanted to choose the perennial Illumina issue it would be the problems one encounters sequencing of low diversity libraries.

    While Illumina generally tackles major issues head on and eventually solves them, the low diversity sequencing issue for some reason seems to be the one they just can't find the fortitude to directly address.

    To tell you the truth, on the HiSeq it is less of an issue because only a tiny percentage of our libraries are low diversity by necessity for this instrument.

    However one of the stated goals of the MiSeq is to entirely obsolete the 454. Obviously to reach that goal you have to be able to do what they call "amplicon" work. And this can include sequencing amplicons derived from a single PCR primer pair.

    This is not possible on the MiSeq without using some of the workarounds. (Note I am talking v2 2x250 base MiSeq reads here.) But I wanted none of them to involve telling an investigator they had to change the way they were constructing the libraries to increase diversity. So here are the ones that remain:

    (1) Spike in a percentage of some genomic DNA library (or several of them). For a zero diversity library I would pick 50%, but it is said one can drop to lower amounts using the "hard coding" work around I will mention below.

    (2) Lower cluster density. I chose 8 pM. This gets me into the 700-800 K Clusters/mm^2 range. Not sure how important this is.

    (3) Hard code the matrix and phasing/prephasing values. This is the most "hard core" of the hacks. Basically it allows you to use a previous run as a "control lane" for your current run.

    While Illumina will gladly recommend the first 2 options as well as attempting to brow beat you into different library prep methodologies, the 3rd option is one they seem loathe to offer at all. I think this partially because "heavy" version of this requires converting format on some data contained in files from a previous run into the appropriate xml format and embedding that in a Miseq configuration file. Lots of ways this can go wrong and not work at all, I think.

    Anyway, for a good description of the issue and both the "heavy" and the "lite" solutions, there is a canonical site you can peruse.

    To run 500 cycle kits you use a v2 MiSeq. Somewhat disconcertingly, the above mentioned site seems to make zero mention of v2 MiSeqs. Neither do documents I was able to obtain from Illumina. It does mention what I am referring to as the "lite" hard coding method. Instead of actually hacking your miseq configuration xml, you just copy and rename couple of files from your control run into RTA's root directory. Then, ostensibly, RTA will make some sort of assessment of your data early in the run. Should it deem it "low diversity", it will use the data from those files to set the matrix and phasing/pre-phasing values.

    Illumina tech support seemed unaware of this capability initially. They suggested I use the "heavy" method to make sure the hard coding actually happened.

    Here are the results from a "worst case low diversity amplicon set"

    without hard coding:


    with hard coding:


    Anyway, a couple of final points. First the run using only 2 of the 3 workarounds still produced usable data. Also much of the data assessment is the instrument's own, not really empirically determined. However the "error rate" is said to be the result of real alignment to the phiX genome. There are some disturbing things going on there in both runs. Although the hard coded run looks much better. Finally, this is a single run pair I am comparing. We all under stand that makes the information presented anecdotal and that "Your Milage May Vary".

    --
    Phillip

  • #2
    Phillip,

    Are these nextera runs? How many samples were multiplexed and what is the average insert size? Are the two reads expected to overlap? Can you post example FastQC quality profile plots for one sample for the two variation of the runs you have posted above?

    We have some really difficult multiplex samples that have major quality value issues (which I suspect are artificial) in spite of using all three workarounds you have listed above. We are continuing to work with Illumina actively.

    BTW: Even for the first run the data are well within the published illumina spec of >75% data at Q30 for a 2 x 250 bp run (except for read 1).
    Last edited by GenoMax; 02-18-2013, 09:10 AM. Reason: Additional question

    Comment


    • #3
      Originally posted by GenoMax View Post
      Phillip,

      Are these nextera runs?
      They are V3/V4 loop amplicons indexed using TruSeq Custom Amplicon (TCSA) style (sequence) dual indexes. The customer makes a "fusion primer" combining their locus and the proximal TruSeq adapter up to where the index is. They then reamplify with index containing primers that overlap that proximal TruSeq adapter sequence, but not the locus specific primer. Then they purify and pool their sample before passing them to us. We usually do an additional Ampure clean-up to shake loose a little more of the primer dimers.
      Originally posted by GenoMax View Post
      How many samples were multiplexed and what is the average insert size?
      In this case, all 96 TSCA index pairs, plus 3 single index TruSeq samples that carry the genomic libraries ("ballast") to increase sequence diversity. Insert size was about 400-450 bp.
      Originally posted by GenoMax View Post
      Are the two reads expected to overlap?
      Yes. The idea is to merge them so they can be run through a 454 QIIME pipeline by the customer.
      Originally posted by GenoMax View Post
      Can you post example FastQC quality profile plots for one sample for the two variation of the runs you have posted above?
      I can after those get generated for the hard coded run.

      Originally posted by GenoMax View Post
      We have some really difficult multiplex samples that have major quality value issues (which I suspect are artificial) in spite of using all three workarounds you have listed above. We are continuing to work with Illumina actively.

      BTW: Even for the first run the data are well within the published illumina spec of >75% data at Q30 for a 2 x 250 bp run (except for read 1).
      Yes, I felt the run almost made it without requiring hard coding. Also the PANDA merging results looked fine. I would think it was just a QV assignment issue problem, but I would not expect that to effect the Error rate as depicted by SAV

      --
      Phillip

      Comment


      • #4
        Phillip,
        Could you post a quick diagram of that indexing method please?

        Comment


        • #5
          Originally posted by genbio64 View Post
          Phillip,
          Could you post a quick diagram of that indexing method please?


          The arrow is a cartoon of the TCSA left adapter. The orange box denotes the 8 base "i5" index. The green box, as labelled, is some locus specific sequence. The 1st PCR primer fuses the locus-specific sequence with 33 bases of the proximate end of the TCSA adapter. The 2nd PCR primer overlaps the first by 20 bases.

          You would also need the right adapter oligos. Basically the same design but with slightly different lengths.

          For 96 indexes, you would want 8 i5 indexs and 12 i7 indexes. For 384 you would want 16 and 24, respectively.

          I actually screwed up on the right-side oligos and included the reverse complements of the TCSA i7 indexes. But as long as one puts the right sequences in the sample sheet, everything works out okay.

          --
          Phillip

          Comment


          • #6
            I'll say that we've had good success having 50% genomic DNA of an organism we needed sequenced anyway/felt like getting data on, and having our amplicon library with a 12bp random barcode on the front end. We had at least 92 libraries on our run- it didn't make much sense to have less than that for the cost (pre-cluster all the libraries in house, and hand over an "amplicon" tube that the center could prep as usual). Our Forward read was great, with issues on the reverse. We're working around that now (Double barcoding, or something else we're going to try, and perhaps publish on if it works)- either way, we had enough data from the forward to move ahead. These are 16s rDNA libraries by the way- primers from the ARB group's recent publication on designing better universal primers.

            Comment


            • #7
              Yes, we usually have the problems with the second read. In fact, this was first time I had seen a problematic 1st read but good 4th read.

              --
              Phillip

              Comment


              • #8
                Phillip: Curious to see if the quality patterns changed at all between the two runs for a specific sample. You were going to post quality plots.

                I think the new version MCS v.2.1.1.13 has done the most so far to improve the qualities (along with the new batch of kits which are performing well) but we are not there yet.

                Comment


                • #9
                  Originally posted by GenoMax View Post
                  Phillip: Curious to see if the quality patterns changed at all between the two runs for a specific sample. You were going to post quality plots.

                  I think the new version MCS v.2.1.1.13 has done the most so far to improve the qualities (along with the new batch of kits which are performing well) but we are not there yet.
                  Sorry that is going to take a while longer. Our servers are completely hammered at the moment with a hiseq run that just came off and fastqc was hanging so Rick had to kill off those processes.

                  Do you usually see differences between fastqc's assessment of the quality of a run and SAV's? I posted the SAVs quality heat map.

                  --
                  Phillip

                  Comment


                  • #10
                    Originally posted by pmiguel View Post
                    Sorry that is going to take a while longer.
                    No Problem.

                    Originally posted by pmiguel View Post

                    Do you usually see differences between fastqc's assessment of the quality of a run and SAV's? I posted the SAVs quality heat map.

                    --
                    Phillip
                    SAV shows an average representation of the values for all samples. I am interested to see if the actual quality values changed from one run to the other for individual sample(s). If you can pick a sample that had a overall low mean Q-value (based on the demultiplex summary report). OTH, you may not have any, if all your pooled samples look more or less the same.

                    Comment


                    • #11
                      We've been sequencing recombined human antibody genes, which are pretty low diversity, especially at the start of both paired reads. In our case, it's especially critical that we get good quality for most of the read length. The amplicons are about 400bp in length, and we must be able to merge the forward/reverse reads into a single amplicon -- unmerged reads are essentially useless.

                      We've had the same sort of low-diversity issues that the 16S folks have had, but came up with a different solution. We mostly use off-site sequencing providers, so we wanted our method to be dependent on sample prep as much as possible, to allow us flexibility in selecting providers (some were unwilling to perform the 'hard-core' hack mentioned above). What we did was "offset" the reads by inserting varying numbers of N's between the sequencing primer and the gene-specific amplification primer. It turns out that multiples of 2 N's works best (-NN-, -NNNN-, -NNNNNN-, etc). Not sure why, but my guess is that adjacent clusters that are offset by only a single position can mess with phasing/prephasing calculations. Of course, this method entails making your own fusion primers, but that's something we were willing to do. In combination with other fairly common low-diversity techniques (high PhiX spike-in, lower cluster density), this approach has worked very well.

                      Here's what the Qscores look like without the offset primers:



                      And with the offset primers:

                      Comment


                      • #12
                        Phillip, thanks for your post. Do I understand correctly, that you used 50% phiX? That would confirm our observation that phiX spiking is of limited effect with the v2 kits.

                        When not using hardcoded phasing we see pretty consistently what you are showing: read4 somehow is better than read1. This seems to be connected to the prephasing value. For some unknown reason, prephasing is calculated very high for the forward read and low for the reverse read.
                        2 non hardcoded examples with about 6% phiX spike and amplicons (12 different ones)
                        Attached Files
                        Last edited by Vinz; 02-21-2013, 12:48 AM.

                        Comment


                        • #13
                          When using hardcoded matrix/phasing we get Q30 success rates of above 75%, usually above 80%.
                          In contrast to what Illumina is saying we see no positive effect of:
                          - spiking more than 10% phiX
                          - reducing cluster density (700 to 1000 seems to be fine)
                          Attached Files

                          Comment


                          • #14
                            Originally posted by BBthekid007 View Post
                            We've been sequencing recombined human antibody genes, which are pretty low diversity, especially at the start of both paired reads. In our case, it's especially critical that we get good quality for most of the read length. The amplicons are about 400bp in length, and we must be able to merge the forward/reverse reads into a single amplicon -- unmerged reads are essentially useless.

                            We've had the same sort of low-diversity issues that the 16S folks have had, but came up with a different solution. We mostly use off-site sequencing providers, so we wanted our method to be dependent on sample prep as much as possible, to allow us flexibility in selecting providers (some were unwilling to perform the 'hard-core' hack mentioned above). What we did was "offset" the reads by inserting varying numbers of N's between the sequencing primer and the gene-specific amplification primer. It turns out that multiples of 2 N's works best (-NN-, -NNNN-, -NNNNNN-, etc). Not sure why, but my guess is that adjacent clusters that are offset by only a single position can mess with phasing/prephasing calculations. Of course, this method entails making your own fusion primers, but that's something we were willing to do. In combination with other fairly common low-diversity techniques (high PhiX spike-in, lower cluster density), this approach has worked very well.

                            Here's what the Qscores look like without the offset primers:



                            And with the offset primers:

                            Yes, your libraries then become effectively diverse by your systematically offsetting them. That is one of the methods Illumina wants you to use.

                            If I were making the libraries myself, I would probably employ a method something like that. But, although it is simple enough to understand if you are intimately familiar with this aspect of Illumina instruments, I just feel like I am making the world a worse place to live in every time I try to explain this stuff to a customer. Things are complex enough without added strange work-arounds to avoid bugs in an instrument system design.

                            The real solution needs to come from Illumina, but they aren't going to bother doing it unless they get enough complaints.

                            --
                            Phillip

                            Comment


                            • #15
                              Originally posted by Vinz View Post
                              Phillip, thanks for your post. Do I understand correctly, that you used 50% phiX? That would confirm our observation that phiX spiking is of limited effect with the v2 kits.

                              When not using hardcoded phasing we see pretty consistently what you are showing: read4 somehow is better than read1. This seems to be connected to the prephasing value. For some unknown reason, prephasing is calculated very high for the forward read and low for the reverse read.
                              2 non hardcoded examples with about 6% phiX spike and amplicons (12 different ones)
                              Sort of. I don't like to waste sequencing capacity on phiX, so I allow the customers to give us some genomic DNA they want sequenced and construct library(ies) from that.

                              We have a lot of "worst case" single amplicon projects, so I think we will continue spiking in 50% ballast libraries to help even those out. Also we will use hard coding.

                              Question: are your amplicons short enough to overlap the reads? For the run we describe above, the amplicons have 450 bp inserts. So for a paired read merge (Rick uses PANDA, but seems like most people use FLASH), one would expect to need high quality sequence over the entire length of both reads to effect a good merge. However, mysteriously, we had very high rates of successful merges even though the quality drops very low past 180 bases for read 1.

                              This could be simple a case of the instrument mistakenly assigning low quality values while correctly assigning the base calls. However, as you can see from the graphs above, the phiX-calculated error rates become very high at the point where the quality values become low. My understanding is that these were empirically determined error rates. That is, that RTA actually aligns the reads to phiX and calculates the error rate from disagreements between the alignment at a particular base.

                              What do you think? Is RTA actually "cheating" and just using quality values to assign the error rate? Something else?

                              Are you able to merge your forwards/reverse reads?

                              --
                              Phillip

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X