SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
ddRAD-Seq blair.chr Sample Prep / Library Generation 10 06-19-2014 03:01 PM
ddRAD with frogs/large genomes atcghelix Sample Prep / Library Generation 2 02-27-2014 08:55 PM
ddRAD protocol by Peterson et al. Hanne Sample Prep / Library Generation 1 11-22-2013 07:41 AM
ddRAD analysis pipeline Mattk Introductions 1 08-22-2013 04:56 AM
Advice on speeding up grep for ddRAD preprocessing JackieBadger Bioinformatics 6 03-20-2013 05:11 PM

Reply
 
Thread Tools
Old 04-21-2014, 11:07 AM   #1
maxbangs
Junior Member
 
Location: Fayetteville, AR

Join Date: Apr 2014
Posts: 9
Default ddRAD RE selection Peterson et al.

My lab is about to embark on ddRAD following the Peterson et al. protocol. We are trying to decide on restriction enzymes (RE) to try. We want to get around 40,000 loci from single end sequencing with 25x coverage with 96 individuals per lane of HiSeq2000. The organism we are looking at is ~2.42Gb with 38% GC content.

The problem I have now is choosing a set of RE to give us said number of loci in a reasonable size selection. Does anyone have any suggestions on what combination to try?
maxbangs is offline   Reply With Quote
Old 04-23-2014, 09:28 AM   #2
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 454
Default

I take it you don't have a reference to search and get the actual distribution of sites?

Without that, there are quite a few unknowns that could affect your final result. For example, there are many more SbfI sites in some fish compared to others even though they have similar GC % and genome size. But, you can still try to predict it as best you can!

Clearly you want more than 40,000 loci for your less-frequent cutter and then use the more-frequent cutter to get a portion of those. PstI gives you 300k sites, SbfI gives you 10k sites. I wouldn't want to go with a purely GC 6-cutter (~100k sites) since that probably targets non-random sequences more than a mixed site, and same with a pure AT 8-cutter like PacI. So I think your safest bet is PstI, combined with a balanced (AT/GC) 4-cutter that will typically cut every 300 bp, then get out the Pippin and make your size selection. If you cut a bit away from the peak of 4-cutter size distribution (say 150-200 insert size) then you can pick up the 15% of fragments you want and have a large enough region to try to avoid locus drop-outs from the different size selections.

You may need to devote more sequencing per sample depending on what proportion of your loci you want to be above the 25X read depth. Right now you have 40k x 25 x 96 = 96M reads. But you will probably see a 4-fold range in sequencing depth per sample, and a wide range of depths between different loci. You probably have to toss 25% of the reads as being repetitive as well. So for markers with complete data, the samples with lower reads may only have 5k loci above 25X depth.

If you want to start thinking about analysis, here is a good video from Cornell about some of the packages (and a warning about population statistics at the end): http://www.cornell.edu/video/rad-gbs...ing-strategies
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 05-17-2014, 08:44 AM   #3
maxbangs
Junior Member
 
Location: Fayetteville, AR

Join Date: Apr 2014
Posts: 9
Default Adapters

Thank you for you quick response, we are testing some enzymes now (PstI and EcoRI in combination with MspI and HpyCh4V). They seem to be working well and we are going to run them on a TapeStation next week.

I did have another question. Once we figure out the enzyme combination we are going to order the adaptors/primers from the Peterson et al. (2011) protocol. Does anyone have suggestions on where to order the oligos and if there is anything special we need to do with them?
maxbangs is offline   Reply With Quote
Old 05-21-2014, 12:10 PM   #4
maxbangs
Junior Member
 
Location: Fayetteville, AR

Join Date: Apr 2014
Posts: 9
Default TapeStation results

So, I ran the samples on the TapeStation and am trying to figure out the number of loci per size selection. (I can post my results later once I figure out my size selection).

If I wanted to figure out the number of fragments for ddRAD in which the organisms genome size is ~2.5Gbps and the concentration of the whole sample is 19.6ng/ul and the size selection of 300+/- 36 bps yields 0.47ng/ul, then does the following math work at.

2,500,000,000bps * .47ng / 19.6ng /300bps / 2 = 99,915 loci

Thus if I do a size selection on the Pippin Prep of 376+/- 36bps (added 76bps for adapters) then I should get ~100,000loci. Is this right? If so then why is this more then double what I expected based on Peterson's estimates and in silico size selection? This is EcoRI by MspI by the way, I also have results for for other combinations across a few different species.
maxbangs is offline   Reply With Quote
Old 05-21-2014, 12:31 PM   #5
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 454
Default

That makes sense (your calculation). There's some minor change in fragment number on the small side of the average compared to the large side, but with a tight selection it won't matter so much. Did you have an actual genome reference for the in silico digest?
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 05-21-2014, 12:50 PM   #6
maxbangs
Junior Member
 
Location: Fayetteville, AR

Join Date: Apr 2014
Posts: 9
Default

We do not have the genome for the fish I am looking at but it is a close(ish) relative to Danio which has a published genome. The one problem is that the fish I am looking at has had a genome duplication since its split with Danio. So we are just doubling the in silico estimates for Danio and hoping it is at least a rough ball park estimate. Estimates also fit with the estimates from Peterson.

Any idea on why we are so far off? Also we ran the following RE combinations and got the following estimates for number of loci:
EcoRI x PstI = 41,667 loci
EcoRI x MspI = 99,915 loci
PstI x MspI = 118,912 loci
PstI x HpyC4IV = 131,340 loci

Based on this I was thinking about going with EcoRI x PstI so that I do not have too narrow of a size selection, but it seems odd to go with two six base cutters.
maxbangs is offline   Reply With Quote
Old 05-21-2014, 01:41 PM   #7
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 454
Default

It might just be genome variation. Here are (rough) estimates of PstI numbers in zebrafish:
[genomes]$ cat Daniozv9.fa | tr -d "\n" | grep -io -E "CTGCAG" | wc -l
90066
and stickleback:
[genomes]$ cat Gasterosteus_aculeatus.BROADS1.61.dna_rm.toplevel.fa | tr -d "\n" | grep -io -E "CTGCAG" | wc -l
141198

More sites in stickleback, even though it has a genome size of 450Mbp and zebrafish is 1.5Gbp! Stickleback has a slightly higher GC content (42% vs 38%) but that is not enough to explain the difference. So I would trust the empirical data (Tapestation) more than an in silico search of a related genome, as long as you trust your Tapestation to give accurate results.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 05-21-2014, 02:32 PM   #8
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 346
Default

Quote:
Originally Posted by maxbangs View Post
I did have another question. Once we figure out the enzyme combination we are going to order the adaptors/primers from the Peterson et al. (2011) protocol. Does anyone have suggestions on where to order the oligos and if there is anything special we need to do with them?
It is very convenient if the manufacturer provides you the oligos plated and already normalized to a standard concentration. I believe most manufacturers will do that (LifeTechnologies, IDT, Bioneer, ... ) but that would save a ton of work.
luc is offline   Reply With Quote
Old 05-21-2014, 05:37 PM   #9
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,196
Default

Quote:
If I wanted to figure out the number of fragments for ddRAD in which the organisms genome size is ~2.5Gbps and the concentration of the whole sample is 19.6ng/ul and the size selection of 300+/- 36 bps yields 0.47ng/ul, then does the following math work at.

2,500,000,000bps * .47ng / 19.6ng /300bps / 2 = 99,915 loci
It seems to me that you have calculated the number of fragments with average size of 300+_XX bp resulting from your digestion. The sequence-able portion of that number will be the fragments that flanked by restriction site (overhang) of both enzymes. Fragments that are purely flanked with one of enzymes site will not contribute to your library or the reads you will obtain from sequencing.
Peterson et al has described a protocol for estimating number of sequence-able fragments in their supplementary material. Although they have not used it or described it in their paper and how that correlates with real data. I have great doubts about practicality or accuracy of their described method. I will post my reasons in detail later. In the meantime I am very interested to see you TapeStation results for your digests and your rational for the way you have estimated fragment numbers in your target range.
nucacidhunter is offline   Reply With Quote
Old 05-21-2014, 07:28 PM   #10
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 454
Default

Good point! I was thinking that EcoRI would cut every 3kb so the problem would be negligible, but there will be plenty of MspI - MspI fragments in the 300 bp range that will not become ddRAD loci. The PstI - EcoRI double digest should be more accurate, but I guess even there for every Pst-EcoRI fragment you'll have an equal number of PstI-PstI fragments and equal number of EcoRI-EcoRI fragments in the mix. Actually, many more of the EcoRI-EcoRI fragments given the GC content.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 05-22-2014, 11:45 AM   #11
maxbangs
Junior Member
 
Location: Fayetteville, AR

Join Date: Apr 2014
Posts: 9
Default TapeStation results

Thank you for your comments nucacidhunter and SNPsaurus.
I guess I should explain my math. I was trying to calculated the number of expected loci for a given size selection (300+/-36) so I calculated the proportion of DNA in the size selection (.47ng/19.4ng) and then multiple by the size of the genome (2.42Gbps) to get the number of bps with in the selection. I then divided by the average size of the fragment (300bps) to get the number of fragments within the size selection. Finally I divide by 2 to get the number of loci, since the organism is diploid. However as nucacidhunter pointed out I forgot to take into account that there are fragments that are not sequence-able (e.g. EcoRI-EcoRI and MspI-MspI). To account for this I am just going to divide by 2 (thus assuming ~50% of the fragments are sequence-able).

242000000bps * 0.47ng / 19.4ng / 300bps / 2 / 2 = 4885 loci

I know it is a big assumption that 50% of the fragments are sequence-able, but after running some simulations this seems to work as long as you are not selecting any place were the slope of the distribution of fragments is high (this includes the slope for all three fragment types). Thus if I go with a size selection around 300 I get a fairly consistent estimate that matches my in silico estimates. However if I do the some for a size selection around 224+/-36 I get a much more erratic result.

So you may be wondering way I am doing this and not just going with the in silico estimates. Most of the organisms that I work with do not have a reference genome for any species closely related. Thus, I want to develop a system to estimate the number of fragments completely de novo.

As per request I tried to attached the TapeStation results. However the .doc file and the raw result file is too each too large. If you want the results to look at just send me an email at mbangs@uark.edu.

If you do want the results there are two files 1) .doc file of results given to me by the facility and 2) the raw data from the TapeStation. You may noticed that there is a warning for some samples that the concentration is too low. I thought they wanted the concentrations to be between 1ng/ul - 50ng/ul (as per normal D1000 DNA tapes) but since we were using the genomic DNA tape the concentrations were supposed to be >20ng/ul. They still ran fine and the total concentration from the TapeStation matches that of the Qubit. If you want to play with the raw data you can download the program for free. If anyone wants to know the total contractions from the Qubit or want to know what the genome sizes of the fish (four species) we are used or the different RE combinations we are used (four combinations) just let me know. This message is already way to long.

Hope this is hopeful and thank you for the fast responses.
maxbangs is offline   Reply With Quote
Old 05-23-2014, 04:19 PM   #12
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 454
Default

I would still worry about MspI-MspI fragments. That's 4 nucs that are GC, in a genome that is 38% GC, so you'll get a MspI site every 700 bp or so. If there is a 699/700 chance of not getting that site at some particular nuc, then in a 72 bp region there is a 10% chance of seeing a site (so if you have a MspI site somewhere, then 10% of those will have another MspI site 300 bp away). The same logic applies to MspI sites near EcoRI or PstI sites. But there are 10X more MspI sites to start with than PstI, so you need to divide by more than 2. EcoRI is better (4 fold more MspI sites).
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 05-23-2014, 08:12 PM   #13
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,196
Default

Based on my experience I would advise against using gDNA ScreenTape for sizing your fragments. It is dependent on load and serial dilution of the same sample (within the range specified for them) will give different size outcomes. In addition, the approach that you are taking (estimating fragment number based on digestion result) may hinder successful library prep in some occasions. If the size window that you are selecting comprises fragments from repeat regions and organelles you may not have many useful SNPs to call.

Quote:
So you may be wondering way I am doing this and not just going with the in silico estimates. Most of the organisms that I work with do not have a reference genome for any species closely related. Thus, I want to develop a system to estimate the number of fragments completely de novo.
The issue with in silico approach is that during actual size selection with Pippin, eluted fragments will be different from set point because one end of fragments has a Y shape adapter and that affects the migration speed. Other issue is that Pippin size selection is also load dependent and one can expect different results based on DNA amount loaded on them.

Last edited by nucacidhunter; 05-23-2014 at 08:46 PM.
nucacidhunter is offline   Reply With Quote
Old 05-23-2014, 10:59 PM   #14
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 454
Default

nucacidhunter, I have often wondered about the repeats in a particular size range issue. When my lab was working on RAD-Seq, one of the reasons we liked having one side of the RAD tag be sheared is that we saw talks from people doing RRL and how they spent so much time checking different size ranges for repeat content... it took longer to decide on a size range than to do the actual experiment. But I don't hear about that from ddRAD or GBS talks. Is it just that sequencing has gotten cheaper and it isn't worth getting fussed over losing 25M reads to repeats?

Good point about the Y-adapters as well. Without a good reference genome it is hard to feel that confident that the number of sites will translate from a related genome anyway, so sometimes a person has to just plunge ahead and try it!
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 05-24-2014, 04:17 PM   #15
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,196
Default

In GBS methods (single or double digest) one can recognise the presence of repeat region or organelle fragments’ amplicons in prepped library and not to proceed to sequencing. I believe this is one aspect that they look at the establishment phase for new species. Users often utilise only 50% of their GBS reads because of low coverage and low number of common loci among samples but they still are happy getting around 1K polymorphic loci from their data. Some users repeat their samples in their submission to increase coverage. Obviously, number of useful fragments depends on study aim, population type, existence of reference genome and other factors. ddRAD as it has been described leaves sampling repeat region to chance.

RAD-Seq is most comprehensive and probably more consistent version of the methods, but library prep costs can be prohibitive and can exceed sequencing costs.

Last edited by nucacidhunter; 05-24-2014 at 05:29 PM.
nucacidhunter is offline   Reply With Quote
Old 06-16-2016, 06:39 AM   #16
Gislaine
Junior Member
 
Location: São Paulo, Brazil

Join Date: May 2016
Posts: 1
Default

Quote:
Originally Posted by maxbangs View Post
Thank you for your comments nucacidhunter and SNPsaurus.
I guess I should explain my math. I was trying to calculated the number of expected loci for a given size selection (300+/-36) so I calculated the proportion of DNA in the size selection (.47ng/19.4ng) and then multiple by the size of the genome (2.42Gbps) to get the number of bps with in the selection.
Hi!

Please, I would like to know if these concentration values of ng/ul are from real data, because I am doing ddRAD and I found similar concentration. But I didn't do qPCR yet to see if the final concentration is good for sequencing.

Cheers
Gislaine is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:42 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO