SEQanswers

Go Back   SEQanswers > Applications Forums > Genomic Resequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Agilent SureSelect: problems with target enrichment Asereth SOLiD 6 06-05-2012 10:37 PM
downsampling high coverage regions in SAGs novi Bioinformatics 0 06-30-2011 01:28 AM
Questions on Agilent SureSelect Indexing Kit if1 General 1 05-18-2010 06:32 AM
Insert size with Agilent SureSelect sdavis Sample Prep / Library Generation 12 05-14-2010 10:39 AM
Agilent SureSelect enrichment javier Sample Prep / Library Generation 1 03-30-2010 11:39 AM

Reply
 
Thread Tools
Old 02-16-2010, 01:19 AM   #1
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default Agilent SureSelect - coverage of high GC regions

We have successfully run a targeted enrichment with SureSelect and were able to achieve similar results to the Tewhey et al (2009 Genome Biol) for our own targeted subset. As shown in their paper, we also noticed that regions of high GC content were difficult to capture - we see lower read coverage in these areas. Does anyone have any experience trying to increase the coverage of these more difficult regions? Say, for example, by increasing the number of baits overlapping a high GC region?

We are wondering if this is a worthwhile approach and if by chance anyone has tried it already with useful results. We have some extra design space on our SureSelect and are considering "piling on" the baits in these regions for a few important genes.
NGSfan is offline   Reply With Quote
Old 02-16-2010, 04:50 AM   #2
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

I would be very interested in how you fare with this.

One possible explanation for dropout of extremes of %GC is not so much the SureSelect hybridization but the various PCR steps. Do you think you could significantly shave the total number of PCR cycles the library is exposed to?
krobison is offline   Reply With Quote
Old 02-16-2010, 06:31 AM   #3
Xi Wang
Senior Member
 
Location: MDC, Berlin, Germany

Join Date: Oct 2009
Posts: 317
Default

Quote:
Originally Posted by NGSfan View Post
We have successfully run a targeted enrichment with SureSelect and were able to achieve similar results to the Tewhey et al (2009 Genome Biol) for our own targeted subset. As shown in their paper, we also noticed that regions of high GC content were difficult to capture - we see lower read coverage in these areas. Does anyone have any experience trying to increase the coverage of these more difficult regions? Say, for example, by increasing the number of baits overlapping a high GC region?

We are wondering if this is a worthwhile approach and if by chance anyone has tried it already with useful results. We have some extra design space on our SureSelect and are considering "piling on" the baits in these regions for a few important genes.
How do you define the sequence coverage? If you take log values of the coverage, what is the results?
__________________
Xi Wang
Xi Wang is offline   Reply With Quote
Old 02-16-2010, 07:02 AM   #4
upenn_ngs
Member
 
Location: philadelphia

Join Date: Sep 2009
Posts: 70
Default

The drop in high-GC content is largely from secondary structure formation. Adding formamide and increasing the temperature might tilt the table toward hybridization with RNA oligo.
upenn_ngs is offline   Reply With Quote
Old 02-17-2010, 06:05 AM   #5
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by krobison View Post
I would be very interested in how you fare with this.

One possible explanation for dropout of extremes of %GC is not so much the SureSelect hybridization but the various PCR steps. Do you think you could significantly shave the total number of PCR cycles the library is exposed to?

For the PCR steps, I mostly worry over "PCR duplicates". And if I remember, wouldn't the PCR bias the coverage in favor of high GC?

http://www.ncbi.nlm.nih.gov/pubmed/18660515
NGSfan is offline   Reply With Quote
Old 02-17-2010, 06:08 AM   #6
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by Xi Wang View Post
How do you define the sequence coverage? If you take log values of the coverage, what is the results?
Good question - I am just doing this "by eye" so to speak. So for example, the average bp coverage of a target region is 20X and then drops to 2 or 0 in high GC regions.
NGSfan is offline   Reply With Quote
Old 02-17-2010, 06:13 AM   #7
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by upenn_ngs View Post
The drop in high-GC content is largely from secondary structure formation. Adding formamide and increasing the temperature might tilt the table toward hybridization with RNA oligo.
The secondary structure issue would be my guess as well. My only concern with adding formamide and/or increasing the temp is the effect on lower GC targets. I like the idea, but shifting the binding energies might cause as many problems as it solves.

The idea of adding more baits was to help increase coverage of a subset of targets without affecting the enrichment of other targets.
NGSfan is offline   Reply With Quote
Old 02-17-2010, 07:03 AM   #8
upenn_ngs
Member
 
Location: philadelphia

Join Date: Sep 2009
Posts: 70
Default

Another factor, many GC rich regions are dropped from both whole genome sequencing as well as the exome capture. This image from the Broad.

http://www.postimage.org/image.php?v=aV6cBnA

Last edited by upenn_ngs; 02-17-2010 at 07:05 AM.
upenn_ngs is offline   Reply With Quote
Old 02-17-2010, 07:24 AM   #9
Xi Wang
Senior Member
 
Location: MDC, Berlin, Germany

Join Date: Oct 2009
Posts: 317
Default

Quote:
Originally Posted by NGSfan View Post
Good question - I am just doing this "by eye" so to speak. So for example, the average bp coverage of a target region is 20X and then drops to 2 or 0 in high GC regions.
I guess if you take log value of the bp coverage for each region, and then take the average, the phenomenon will be different. I am just wondering the amplification is exponentially increased the DNA fragments.
__________________
Xi Wang
Xi Wang is offline   Reply With Quote
Old 02-19-2010, 06:04 AM   #10
dottomarco
Member
 
Location: Padova ITALY

Join Date: Jul 2009
Posts: 32
Default

Quote:
Originally Posted by NGSfan View Post
We have successfully run a targeted enrichment with SureSelect and were able to achieve similar results to the Tewhey et al (2009 Genome Biol) for our own targeted subset.
@ NGSfan : What sequencer did you use? Illumina GA II?
I am interested in using Agilent's SureSelect for sequence enrichment to get the targets to sequence with a 454 FLX. Do you think using a long fragmented 454 library with SureSelect can create any problem with the hybridization? Agilent do not provide any ufficial protocol for 454 libraries, but I assume that their long baits could work well with our ~400-500 bp fragments.
dottomarco is offline   Reply With Quote
Old 02-19-2010, 11:10 AM   #11
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

The main platform-customization of the SureSelect as I understand it is there are blocking oligos to prevent daisy-chaining of products -- without these sometimes a correctly hybridized fragment will hybridize to an off-target fragment via the adapter regions.
krobison is offline   Reply With Quote
Old 03-11-2010, 03:57 AM   #12
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by Xi Wang View Post
I guess if you take log value of the bp coverage for each region, and then take the average, the phenomenon will be different. I am just wondering the amplification is exponentially increased the DNA fragments.
Yes, that is a good point. Amplification is a concern and will certainly bias things. However, I am not seeing PCR duplicates to be a big issue in my data set.

I have talked to some big sequencing centers about the GC issue and they also have encountered it, however their approach is to simply bump up the sequencing - to 70X coverage (we are at 30-40X).

I should have mentioned we did single end reads. We should be getting paired end reads soon, and I hope this might help a little, since we'll be able to sequence a GC-rich region which was partially bound at the other end with an average GC content. Maybe?

Last edited by NGSfan; 03-11-2010 at 03:59 AM.
NGSfan is offline   Reply With Quote
Old 03-11-2010, 04:05 AM   #13
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by dottomarco View Post
@ NGSfan : What sequencer did you use? Illumina GA II?
I am interested in using Agilent's SureSelect for sequence enrichment to get the targets to sequence with a 454 FLX. Do you think using a long fragmented 454 library with SureSelect can create any problem with the hybridization? Agilent do not provide any ufficial protocol for 454 libraries, but I assume that their long baits could work well with our ~400-500 bp fragments.
Their long baits ~120bp are quite reasonable - but I have no clue on the behavior of hybridization when having longer fragments (400-500). I suspect you might run into self-hybridizing issues more often, but who really knows!

We generally followed the Agilent protocol - fragmenting ~200bp . Something to note: if you are after exons, then you don't want too long a fragment because you'll be sequencing at the ends and your aligned reads will be more often "off target" so to speak - in the sense that they will be around the exon, rather than on the exon.
NGSfan is offline   Reply With Quote
Old 03-11-2010, 05:56 AM   #14
Xi Wang
Senior Member
 
Location: MDC, Berlin, Germany

Join Date: Oct 2009
Posts: 317
Default

Quote:
Originally Posted by NGSfan View Post
Yes, that is a good point. Amplification is a concern and will certainly bias things. However, I am not seeing PCR duplicates to be a big issue in my data set.

I have talked to some big sequencing centers about the GC issue and they also have encountered it, however their approach is to simply bump up the sequencing - to 70X coverage (we are at 30-40X).

I should have mentioned we did single end reads. We should be getting paired end reads soon, and I hope this might help a little, since we'll be able to sequence a GC-rich region which was partially bound at the other end with an average GC content. Maybe?
Oh. But if you have the data, you can try what just I mentioned.

And for PE reads, I don't think it can improve a lot. Because it is the DNA fragments that amplified. So the coverage should have some relationship with the GC-content of the DNA fragments. On the other hand, the read GC-content and the DNA fragment GC-content have a high correlation. As a result, the relationship between the read GC-content and the coverage reflects a lot the reality.
__________________
Xi Wang
Xi Wang is offline   Reply With Quote
Old 03-11-2010, 08:32 AM   #15
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Another point which I did not notice here is, # of reads actually sequenced to get 30x exome coverage for the agilent capture stuff.

We notice that only 20% of reads map on-target! Is that a common thing? (Illumina 75bp PE)
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 03-11-2010, 08:34 AM   #16
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Quote:
Originally Posted by upenn_ngs View Post
Another factor, many GC rich regions are dropped from both whole genome sequencing as well as the exome capture. This image from the Broad.

http://www.postimage.org/image.php?v=aV6cBnA
Do you know what the WGS track is, and from where one can obtain it for the IGV?
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 03-17-2010, 02:16 AM   #17
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by Xi Wang View Post
Oh. But if you have the data, you can try what just I mentioned.

And for PE reads, I don't think it can improve a lot. Because it is the DNA fragments that amplified. So the coverage should have some relationship with the GC-content of the DNA fragments. On the other hand, the read GC-content and the DNA fragment GC-content have a high correlation. As a result, the relationship between the read GC-content and the coverage reflects a lot the reality.
I'm not clear on why converting the read coverage to a log scale would help understand distribution better. Simply visualizing coverage on a log scale will simply change the scale you're looking at, no?

Maybe I'm not understanding the advantage. Could you show me an example?
NGSfan is offline   Reply With Quote
Old 03-17-2010, 02:17 AM   #18
if1
Junior Member
 
Location: Italy

Join Date: Nov 2009
Posts: 2
Default

Quote:
Originally Posted by bioinfosm View Post
...
We notice that only 20% of reads map on-target! Is that a common thing? (Illumina 75bp PE)
Hi,
we have also run a TE using SureSelect with Illumina 76bp reads Single End and we obtained similar results to the Tewhey et al (Genome Biol 2009): 50% of uniquely aligned reads were on target with a uniformity of capture similar to what reported in the paper.
I am wondering if someone else has results on Illumina 76 Paired-End, as it seems from Agilent website that the % on target should increase from 50% to 70% using PE protocol.

Thanks
if1 is offline   Reply With Quote
Old 03-17-2010, 02:33 AM   #19
NGSfan
Senior Member
 
Location: Austria

Join Date: Apr 2009
Posts: 181
Default

Quote:
Originally Posted by bioinfosm View Post
Another point which I did not notice here is, # of reads actually sequenced to get 30x exome coverage for the agilent capture stuff.

We notice that only 20% of reads map on-target! Is that a common thing? (Illumina 75bp PE)
Becareful with Tewhey's enrichment calculation, they calculated 48% "on or near target" meaning +/- 150bp around the target! If you look in the text, they actually only got 37% "on target" , so while your 20% is low, it's not terrible in comparison. How much of the genome are you targeting?

In our case with 76-bp single end reads, we targeted 0.09% of the genome and enriched to 35% of the sequences being "on target" , which is a ~390-fold enrichment. If you were to actually convert Tewhey's numbers to solely "on target" (from 0.12% to 37%), then their claim of "about 400 fold enrichment" is actually ~290-fold. Just a small criticism.

We have just completed a 76-bp paired end run with 4 samples multiplexed - I will let you know what we get with our alignment results

Last edited by NGSfan; 03-17-2010 at 02:45 AM.
NGSfan is offline   Reply With Quote
Old 03-18-2010, 12:03 PM   #20
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Thanks NGSfan.

The on or near number, taking +/- 200bp goes to 15%, still pretty low I would guess.

It would be interesting to check, where all the rest 85% reads went!
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:04 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO