SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sequencing a Low diversity library on the HiSeq Simcom Illumina/Solexa 38 09-26-2012 05:01 AM
Running mate pair samples on MiSeq wingtec Illumina/Solexa 3 07-01-2012 09:19 PM
Low Diversity library ( 14 Ts) on HiSeq2000 mmpillai Illumina/Solexa 7 06-13-2012 05:03 AM
Loss of data in low-diversity libraries can be recovered by deferred cluster calling fkrueger Bioinformatics 17 01-24-2012 05:29 PM
Paired End Protocol for Very Low Abundance Samples Daytwa Illumina/Solexa 2 01-09-2011 01:13 PM

Reply
 
Thread Tools
Old 02-18-2013, 06:37 AM   #1
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default Sequencing low diversity samples on the MiSeq

For TL;DR, just skip to the pictures at the bottom of the post.

Not sure if everyone even knows what "low diversity" means in this context. Let me give you a worst case scenario: we use the MiSeq to sequence PCR product derived from 16S V3 loop primers. What this implies is that if we take no other action, and just cluster and run these amplicons, over the first 20 bases of sequence every single cluster will read exactly the same base -- those bases from the V3 loop primer itself. That is low sample diversity -- zero sample diversity in this extreme case.

No need to suggest work-arounds to me, I think I am familiar with them all. Here I just want to give you a "case study" and a little background on what I would call the current state-of-the-art.

Please not that topic has been addressed in other threads. Nothing here is particularly new or shocking. But I think an additional data point will be helpful.

If one wanted to choose the perennial Illumina issue it would be the problems one encounters sequencing of low diversity libraries.

While Illumina generally tackles major issues head on and eventually solves them, the low diversity sequencing issue for some reason seems to be the one they just can't find the fortitude to directly address.

To tell you the truth, on the HiSeq it is less of an issue because only a tiny percentage of our libraries are low diversity by necessity for this instrument.

However one of the stated goals of the MiSeq is to entirely obsolete the 454. Obviously to reach that goal you have to be able to do what they call "amplicon" work. And this can include sequencing amplicons derived from a single PCR primer pair.

This is not possible on the MiSeq without using some of the workarounds. (Note I am talking v2 2x250 base MiSeq reads here.) But I wanted none of them to involve telling an investigator they had to change the way they were constructing the libraries to increase diversity. So here are the ones that remain:

(1) Spike in a percentage of some genomic DNA library (or several of them). For a zero diversity library I would pick 50%, but it is said one can drop to lower amounts using the "hard coding" work around I will mention below.

(2) Lower cluster density. I chose 8 pM. This gets me into the 700-800 K Clusters/mm^2 range. Not sure how important this is.

(3) Hard code the matrix and phasing/prephasing values. This is the most "hard core" of the hacks. Basically it allows you to use a previous run as a "control lane" for your current run.

While Illumina will gladly recommend the first 2 options as well as attempting to brow beat you into different library prep methodologies, the 3rd option is one they seem loathe to offer at all. I think this partially because "heavy" version of this requires converting format on some data contained in files from a previous run into the appropriate xml format and embedding that in a Miseq configuration file. Lots of ways this can go wrong and not work at all, I think.

Anyway, for a good description of the issue and both the "heavy" and the "lite" solutions, there is a canonical site you can peruse.

To run 500 cycle kits you use a v2 MiSeq. Somewhat disconcertingly, the above mentioned site seems to make zero mention of v2 MiSeqs. Neither do documents I was able to obtain from Illumina. It does mention what I am referring to as the "lite" hard coding method. Instead of actually hacking your miseq configuration xml, you just copy and rename couple of files from your control run into RTA's root directory. Then, ostensibly, RTA will make some sort of assessment of your data early in the run. Should it deem it "low diversity", it will use the data from those files to set the matrix and phasing/pre-phasing values.

Illumina tech support seemed unaware of this capability initially. They suggested I use the "heavy" method to make sure the hard coding actually happened.

Here are the results from a "worst case low diversity amplicon set"

without hard coding:


with hard coding:


Anyway, a couple of final points. First the run using only 2 of the 3 workarounds still produced usable data. Also much of the data assessment is the instrument's own, not really empirically determined. However the "error rate" is said to be the result of real alignment to the phiX genome. There are some disturbing things going on there in both runs. Although the hard coded run looks much better. Finally, this is a single run pair I am comparing. We all under stand that makes the information presented anecdotal and that "Your Milage May Vary".

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-18-2013, 08:07 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

Phillip,

Are these nextera runs? How many samples were multiplexed and what is the average insert size? Are the two reads expected to overlap? Can you post example FastQC quality profile plots for one sample for the two variation of the runs you have posted above?

We have some really difficult multiplex samples that have major quality value issues (which I suspect are artificial) in spite of using all three workarounds you have listed above. We are continuing to work with Illumina actively.

BTW: Even for the first run the data are well within the published illumina spec of >75% data at Q30 for a 2 x 250 bp run (except for read 1).

Last edited by GenoMax; 02-18-2013 at 08:10 AM. Reason: Additional question
GenoMax is offline   Reply With Quote
Old 02-18-2013, 08:51 AM   #3
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by GenoMax View Post
Phillip,

Are these nextera runs?
They are V3/V4 loop amplicons indexed using TruSeq Custom Amplicon (TCSA) style (sequence) dual indexes. The customer makes a "fusion primer" combining their locus and the proximal TruSeq adapter up to where the index is. They then reamplify with index containing primers that overlap that proximal TruSeq adapter sequence, but not the locus specific primer. Then they purify and pool their sample before passing them to us. We usually do an additional Ampure clean-up to shake loose a little more of the primer dimers.
Quote:
Originally Posted by GenoMax View Post
How many samples were multiplexed and what is the average insert size?
In this case, all 96 TSCA index pairs, plus 3 single index TruSeq samples that carry the genomic libraries ("ballast") to increase sequence diversity. Insert size was about 400-450 bp.
Quote:
Originally Posted by GenoMax View Post
Are the two reads expected to overlap?
Yes. The idea is to merge them so they can be run through a 454 QIIME pipeline by the customer.
Quote:
Originally Posted by GenoMax View Post
Can you post example FastQC quality profile plots for one sample for the two variation of the runs you have posted above?
I can after those get generated for the hard coded run.

Quote:
Originally Posted by GenoMax View Post
We have some really difficult multiplex samples that have major quality value issues (which I suspect are artificial) in spite of using all three workarounds you have listed above. We are continuing to work with Illumina actively.

BTW: Even for the first run the data are well within the published illumina spec of >75% data at Q30 for a 2 x 250 bp run (except for read 1).
Yes, I felt the run almost made it without requiring hard coding. Also the PANDA merging results looked fine. I would think it was just a QV assignment issue problem, but I would not expect that to effect the Error rate as depicted by SAV

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-18-2013, 09:09 AM   #4
genbio64
Member
 
Location: New York

Join Date: Dec 2009
Posts: 42
Default

Phillip,
Could you post a quick diagram of that indexing method please?
genbio64 is offline   Reply With Quote
Old 02-18-2013, 10:59 AM   #5
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by genbio64 View Post
Phillip,
Could you post a quick diagram of that indexing method please?


The arrow is a cartoon of the TCSA left adapter. The orange box denotes the 8 base "i5" index. The green box, as labelled, is some locus specific sequence. The 1st PCR primer fuses the locus-specific sequence with 33 bases of the proximate end of the TCSA adapter. The 2nd PCR primer overlaps the first by 20 bases.

You would also need the right adapter oligos. Basically the same design but with slightly different lengths.

For 96 indexes, you would want 8 i5 indexs and 12 i7 indexes. For 384 you would want 16 and 24, respectively.

I actually screwed up on the right-side oligos and included the reverse complements of the TCSA i7 indexes. But as long as one puts the right sequences in the sample sheet, everything works out okay.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-18-2013, 01:37 PM   #6
bstamps
Member
 
Location: University of Oklaoma

Join Date: Oct 2012
Posts: 40
Default

I'll say that we've had good success having 50% genomic DNA of an organism we needed sequenced anyway/felt like getting data on, and having our amplicon library with a 12bp random barcode on the front end. We had at least 92 libraries on our run- it didn't make much sense to have less than that for the cost (pre-cluster all the libraries in house, and hand over an "amplicon" tube that the center could prep as usual). Our Forward read was great, with issues on the reverse. We're working around that now (Double barcoding, or something else we're going to try, and perhaps publish on if it works)- either way, we had enough data from the forward to move ahead. These are 16s rDNA libraries by the way- primers from the ARB group's recent publication on designing better universal primers.
bstamps is offline   Reply With Quote
Old 02-19-2013, 03:14 AM   #7
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Yes, we usually have the problems with the second read. In fact, this was first time I had seen a problematic 1st read but good 4th read.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-19-2013, 03:25 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

Phillip: Curious to see if the quality patterns changed at all between the two runs for a specific sample. You were going to post quality plots.

I think the new version MCS v.2.1.1.13 has done the most so far to improve the qualities (along with the new batch of kits which are performing well) but we are not there yet.
GenoMax is offline   Reply With Quote
Old 02-19-2013, 06:47 AM   #9
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by GenoMax View Post
Phillip: Curious to see if the quality patterns changed at all between the two runs for a specific sample. You were going to post quality plots.

I think the new version MCS v.2.1.1.13 has done the most so far to improve the qualities (along with the new batch of kits which are performing well) but we are not there yet.
Sorry that is going to take a while longer. Our servers are completely hammered at the moment with a hiseq run that just came off and fastqc was hanging so Rick had to kill off those processes.

Do you usually see differences between fastqc's assessment of the quality of a run and SAV's? I posted the SAVs quality heat map.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-19-2013, 07:34 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

Quote:
Originally Posted by pmiguel View Post
Sorry that is going to take a while longer.
No Problem.

Quote:
Originally Posted by pmiguel View Post

Do you usually see differences between fastqc's assessment of the quality of a run and SAV's? I posted the SAVs quality heat map.

--
Phillip
SAV shows an average representation of the values for all samples. I am interested to see if the actual quality values changed from one run to the other for individual sample(s). If you can pick a sample that had a overall low mean Q-value (based on the demultiplex summary report). OTH, you may not have any, if all your pooled samples look more or less the same.
GenoMax is offline   Reply With Quote
Old 02-20-2013, 08:34 AM   #11
BBthekid007
Junior Member
 
Location: Nashville

Join Date: Feb 2010
Posts: 7
Default

We've been sequencing recombined human antibody genes, which are pretty low diversity, especially at the start of both paired reads. In our case, it's especially critical that we get good quality for most of the read length. The amplicons are about 400bp in length, and we must be able to merge the forward/reverse reads into a single amplicon -- unmerged reads are essentially useless.

We've had the same sort of low-diversity issues that the 16S folks have had, but came up with a different solution. We mostly use off-site sequencing providers, so we wanted our method to be dependent on sample prep as much as possible, to allow us flexibility in selecting providers (some were unwilling to perform the 'hard-core' hack mentioned above). What we did was "offset" the reads by inserting varying numbers of N's between the sequencing primer and the gene-specific amplification primer. It turns out that multiples of 2 N's works best (-NN-, -NNNN-, -NNNNNN-, etc). Not sure why, but my guess is that adjacent clusters that are offset by only a single position can mess with phasing/prephasing calculations. Of course, this method entails making your own fusion primers, but that's something we were willing to do. In combination with other fairly common low-diversity techniques (high PhiX spike-in, lower cluster density), this approach has worked very well.

Here's what the Qscores look like without the offset primers:



And with the offset primers:

BBthekid007 is offline   Reply With Quote
Old 02-20-2013, 11:40 PM   #12
Vinz
Member
 
Location: Germany

Join Date: Dec 2010
Posts: 80
Default

Phillip, thanks for your post. Do I understand correctly, that you used 50% phiX? That would confirm our observation that phiX spiking is of limited effect with the v2 kits.

When not using hardcoded phasing we see pretty consistently what you are showing: read4 somehow is better than read1. This seems to be connected to the prephasing value. For some unknown reason, prephasing is calculated very high for the forward read and low for the reverse read.
2 non hardcoded examples with about 6% phiX spike and amplicons (12 different ones)
Attached Images
File Type: jpg noHardcoded1.jpg (36.0 KB, 34 views)
File Type: jpg nohardcoded_summary1.jpg (60.8 KB, 31 views)
File Type: jpg noHardcoded2.jpg (36.9 KB, 18 views)
File Type: jpg nohardcoded_summary2.jpg (61.0 KB, 14 views)

Last edited by Vinz; 02-20-2013 at 11:48 PM.
Vinz is offline   Reply With Quote
Old 02-20-2013, 11:46 PM   #13
Vinz
Member
 
Location: Germany

Join Date: Dec 2010
Posts: 80
Default

When using hardcoded matrix/phasing we get Q30 success rates of above 75%, usually above 80%.
In contrast to what Illumina is saying we see no positive effect of:
- spiking more than 10% phiX
- reducing cluster density (700 to 1000 seems to be fine)
Attached Images
File Type: jpg hardcoded1.jpg (35.2 KB, 28 views)
File Type: jpg hardcoded_summary2.jpg (60.8 KB, 22 views)
Vinz is offline   Reply With Quote
Old 02-21-2013, 04:10 AM   #14
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by BBthekid007 View Post
We've been sequencing recombined human antibody genes, which are pretty low diversity, especially at the start of both paired reads. In our case, it's especially critical that we get good quality for most of the read length. The amplicons are about 400bp in length, and we must be able to merge the forward/reverse reads into a single amplicon -- unmerged reads are essentially useless.

We've had the same sort of low-diversity issues that the 16S folks have had, but came up with a different solution. We mostly use off-site sequencing providers, so we wanted our method to be dependent on sample prep as much as possible, to allow us flexibility in selecting providers (some were unwilling to perform the 'hard-core' hack mentioned above). What we did was "offset" the reads by inserting varying numbers of N's between the sequencing primer and the gene-specific amplification primer. It turns out that multiples of 2 N's works best (-NN-, -NNNN-, -NNNNNN-, etc). Not sure why, but my guess is that adjacent clusters that are offset by only a single position can mess with phasing/prephasing calculations. Of course, this method entails making your own fusion primers, but that's something we were willing to do. In combination with other fairly common low-diversity techniques (high PhiX spike-in, lower cluster density), this approach has worked very well.

Here's what the Qscores look like without the offset primers:



And with the offset primers:

Yes, your libraries then become effectively diverse by your systematically offsetting them. That is one of the methods Illumina wants you to use.

If I were making the libraries myself, I would probably employ a method something like that. But, although it is simple enough to understand if you are intimately familiar with this aspect of Illumina instruments, I just feel like I am making the world a worse place to live in every time I try to explain this stuff to a customer. Things are complex enough without added strange work-arounds to avoid bugs in an instrument system design.

The real solution needs to come from Illumina, but they aren't going to bother doing it unless they get enough complaints.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-21-2013, 04:23 AM   #15
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by Vinz View Post
Phillip, thanks for your post. Do I understand correctly, that you used 50% phiX? That would confirm our observation that phiX spiking is of limited effect with the v2 kits.

When not using hardcoded phasing we see pretty consistently what you are showing: read4 somehow is better than read1. This seems to be connected to the prephasing value. For some unknown reason, prephasing is calculated very high for the forward read and low for the reverse read.
2 non hardcoded examples with about 6% phiX spike and amplicons (12 different ones)
Sort of. I don't like to waste sequencing capacity on phiX, so I allow the customers to give us some genomic DNA they want sequenced and construct library(ies) from that.

We have a lot of "worst case" single amplicon projects, so I think we will continue spiking in 50% ballast libraries to help even those out. Also we will use hard coding.

Question: are your amplicons short enough to overlap the reads? For the run we describe above, the amplicons have 450 bp inserts. So for a paired read merge (Rick uses PANDA, but seems like most people use FLASH), one would expect to need high quality sequence over the entire length of both reads to effect a good merge. However, mysteriously, we had very high rates of successful merges even though the quality drops very low past 180 bases for read 1.

This could be simple a case of the instrument mistakenly assigning low quality values while correctly assigning the base calls. However, as you can see from the graphs above, the phiX-calculated error rates become very high at the point where the quality values become low. My understanding is that these were empirically determined error rates. That is, that RTA actually aligns the reads to phiX and calculates the error rate from disagreements between the alignment at a particular base.

What do you think? Is RTA actually "cheating" and just using quality values to assign the error rate? Something else?

Are you able to merge your forwards/reverse reads?

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-21-2013, 04:48 AM   #16
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by pmiguel View Post


The arrow is a cartoon of the TCSA left adapter. The orange box denotes the 8 base "i5" index. The green box, as labelled, is some locus specific sequence. The 1st PCR primer fuses the locus-specific sequence with 33 bases of the proximate end of the TCSA adapter. The 2nd PCR primer overlaps the first by 20 bases.

You would also need the right adapter oligos. Basically the same design but with slightly different lengths.

For 96 indexes, you would want 8 i5 indexs and 12 i7 indexes. For 384 you would want 16 and 24, respectively.

I actually screwed up on the right-side oligos and included the reverse complements of the TCSA i7 indexes. But as long as one puts the right sequences in the sample sheet, everything works out okay.

--
Phillip
By the way, I have seen a couple of crazy results when the number of PCR cycles chosen is, well, crazy high. If you had time to optimize, I would suggest trying to use the minimum number of cycles for each PCR that gets you into the 2-10 nM range after clean-up and pooling.

Think about what you are doing! Remember each PCR cycle potentially doubles the amount of the initial template. So, say you are amplifying a 1 kb segment of a 1 billion bp genome. 1 ug of DNA from this organism will contain 1 pg of the segment of interest. How many PCR cycles, theoretically, do you need to obtain enough product?

10 cycles? 1 thousand-gold amplification (pg become ng)
20 cycles? 1 million-fold amplification (pg become ug).
30 cycles? 1 billion-fold amplification (pg become mg -- impossible because your PCR will run out of primers, nucleotides, etc.)

Also, the purpose of the (2nd) step-out PCR is to just add the rest of the adapters -- 3 or 4 cycles should be plenty!

Of course PCR can't achieve a doubling of template concentrations in each cycle -- but this gives you an idea of how heavy the hammer you are smashing your project with!
--
Phillip
pmiguel is offline   Reply With Quote
Old 02-21-2013, 05:02 AM   #17
Vinz
Member
 
Location: Germany

Join Date: Dec 2010
Posts: 80
Default

We actually have not tried to merge the bad runs (without hardcoding). But it might be the case, that the Q values are underestimated. The hardcoded runs work well for merging. We have overlaps between 30 and 100bp.
I am pretty sure, that the phiX calculated error rate is a true error rate. I have seen weird spikes in the error rate that did not match the Q-values. Therefore it can not just be cheated Q value display.
Vinz is offline   Reply With Quote
Old 02-21-2013, 05:05 AM   #18
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

Quote:
Originally Posted by pmiguel View Post

Question: are your amplicons short enough to overlap the reads? For the run we describe above, the amplicons have 450 bp inserts. So for a paired read merge (Rick uses PANDA, but seems like most people use FLASH), one would expect to need high quality sequence over the entire length of both reads to effect a good merge. However, mysteriously, we had very high rates of successful merges even though the quality drops very low past 180 bases for read 1.
This has been our observation as well. Read overlaps work adequately.

Quote:
Originally Posted by pmiguel View Post
This could be simple a case of the instrument mistakenly assigning low quality values while correctly assigning the base calls.
That is my hunch right from beginning. It seems as if software just gives up on assignment of quality values after it sees consecutive low-nucleotide complexity data for a certain number of cycles.

Strategy described by "BBthekid007" works well. We have successfully used it.
GenoMax is offline   Reply With Quote
Old 02-21-2013, 05:12 AM   #19
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

Quote:
Originally Posted by pmiguel View Post

The real solution needs to come from Illumina, but they aren't going to bother doing it unless they get enough complaints.

--
Phillip
I can assure you that this is not true. Unfortunately that is all I can say for the moment.
GenoMax is offline   Reply With Quote
Old 02-21-2013, 03:42 PM   #20
LVAndrews
Member
 
Location: Flagstaff, AZ

Join Date: Sep 2012
Posts: 55
Default hardcoding question

Thanks for this excellent thread that seems to distill a lot of questions (and answers) to one place. My only question is, do you bother changing out your config.xml file for a genome run? Seems like this wouldn't be necessary at all.

AK
LVAndrews is offline   Reply With Quote
Reply

Tags
hard coding, low diversity library, miseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:36 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO