Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ddRAD RE selection Peterson et al.

    My lab is about to embark on ddRAD following the Peterson et al. protocol. We are trying to decide on restriction enzymes (RE) to try. We want to get around 40,000 loci from single end sequencing with 25x coverage with 96 individuals per lane of HiSeq2000. The organism we are looking at is ~2.42Gb with 38% GC content.

    The problem I have now is choosing a set of RE to give us said number of loci in a reasonable size selection. Does anyone have any suggestions on what combination to try?

  • #2
    I take it you don't have a reference to search and get the actual distribution of sites?

    Without that, there are quite a few unknowns that could affect your final result. For example, there are many more SbfI sites in some fish compared to others even though they have similar GC % and genome size. But, you can still try to predict it as best you can!

    Clearly you want more than 40,000 loci for your less-frequent cutter and then use the more-frequent cutter to get a portion of those. PstI gives you 300k sites, SbfI gives you 10k sites. I wouldn't want to go with a purely GC 6-cutter (~100k sites) since that probably targets non-random sequences more than a mixed site, and same with a pure AT 8-cutter like PacI. So I think your safest bet is PstI, combined with a balanced (AT/GC) 4-cutter that will typically cut every 300 bp, then get out the Pippin and make your size selection. If you cut a bit away from the peak of 4-cutter size distribution (say 150-200 insert size) then you can pick up the 15% of fragments you want and have a large enough region to try to avoid locus drop-outs from the different size selections.

    You may need to devote more sequencing per sample depending on what proportion of your loci you want to be above the 25X read depth. Right now you have 40k x 25 x 96 = 96M reads. But you will probably see a 4-fold range in sequencing depth per sample, and a wide range of depths between different loci. You probably have to toss 25% of the reads as being repetitive as well. So for markers with complete data, the samples with lower reads may only have 5k loci above 25X depth.

    If you want to start thinking about analysis, here is a good video from Cornell about some of the packages (and a warning about population statistics at the end): http://www.cornell.edu/video/rad-gbs...ing-strategies
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment


    • #3
      Thank you for your response. It was very useful.

      Do you have any suggestions on a balanced (AT/GC) 4-base cutter. I was thinking of testing PstI and EcoRI with MspI, but MspI is all GC. Is this a problem?

      Comment


      • #4
        Adapters

        Thank you for you quick response, we are testing some enzymes now (PstI and EcoRI in combination with MspI and HpyCh4V). They seem to be working well and we are going to run them on a TapeStation next week.

        I did have another question. Once we figure out the enzyme combination we are going to order the adaptors/primers from the Peterson et al. (2011) protocol. Does anyone have suggestions on where to order the oligos and if there is anything special we need to do with them?

        Comment


        • #5
          TapeStation results

          So, I ran the samples on the TapeStation and am trying to figure out the number of loci per size selection. (I can post my results later once I figure out my size selection).

          If I wanted to figure out the number of fragments for ddRAD in which the organisms genome size is ~2.5Gbps and the concentration of the whole sample is 19.6ng/ul and the size selection of 300+/- 36 bps yields 0.47ng/ul, then does the following math work at.

          2,500,000,000bps * .47ng / 19.6ng /300bps / 2 = 99,915 loci

          Thus if I do a size selection on the Pippin Prep of 376+/- 36bps (added 76bps for adapters) then I should get ~100,000loci. Is this right? If so then why is this more then double what I expected based on Peterson's estimates and in silico size selection? This is EcoRI by MspI by the way, I also have results for for other combinations across a few different species.

          Comment


          • #6
            That makes sense (your calculation). There's some minor change in fragment number on the small side of the average compared to the large side, but with a tight selection it won't matter so much. Did you have an actual genome reference for the in silico digest?
            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

            Comment


            • #7
              We do not have the genome for the fish I am looking at but it is a close(ish) relative to Danio which has a published genome. The one problem is that the fish I am looking at has had a genome duplication since its split with Danio. So we are just doubling the in silico estimates for Danio and hoping it is at least a rough ball park estimate. Estimates also fit with the estimates from Peterson.

              Any idea on why we are so far off? Also we ran the following RE combinations and got the following estimates for number of loci:
              EcoRI x PstI = 41,667 loci
              EcoRI x MspI = 99,915 loci
              PstI x MspI = 118,912 loci
              PstI x HpyC4IV = 131,340 loci

              Based on this I was thinking about going with EcoRI x PstI so that I do not have too narrow of a size selection, but it seems odd to go with two six base cutters.

              Comment


              • #8
                It might just be genome variation. Here are (rough) estimates of PstI numbers in zebrafish:
                [genomes]$ cat Daniozv9.fa | tr -d "\n" | grep -io -E "CTGCAG" | wc -l
                90066
                and stickleback:
                [genomes]$ cat Gasterosteus_aculeatus.BROADS1.61.dna_rm.toplevel.fa | tr -d "\n" | grep -io -E "CTGCAG" | wc -l
                141198

                More sites in stickleback, even though it has a genome size of 450Mbp and zebrafish is 1.5Gbp! Stickleback has a slightly higher GC content (42% vs 38%) but that is not enough to explain the difference. So I would trust the empirical data (Tapestation) more than an in silico search of a related genome, as long as you trust your Tapestation to give accurate results.
                Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                Comment


                • #9
                  Originally posted by maxbangs View Post
                  I did have another question. Once we figure out the enzyme combination we are going to order the adaptors/primers from the Peterson et al. (2011) protocol. Does anyone have suggestions on where to order the oligos and if there is anything special we need to do with them?
                  It is very convenient if the manufacturer provides you the oligos plated and already normalized to a standard concentration. I believe most manufacturers will do that (LifeTechnologies, IDT, Bioneer, ... ) but that would save a ton of work.

                  Comment


                  • #10
                    If I wanted to figure out the number of fragments for ddRAD in which the organisms genome size is ~2.5Gbps and the concentration of the whole sample is 19.6ng/ul and the size selection of 300+/- 36 bps yields 0.47ng/ul, then does the following math work at.

                    2,500,000,000bps * .47ng / 19.6ng /300bps / 2 = 99,915 loci
                    It seems to me that you have calculated the number of fragments with average size of 300+_XX bp resulting from your digestion. The sequence-able portion of that number will be the fragments that flanked by restriction site (overhang) of both enzymes. Fragments that are purely flanked with one of enzymes site will not contribute to your library or the reads you will obtain from sequencing.
                    Peterson et al has described a protocol for estimating number of sequence-able fragments in their supplementary material. Although they have not used it or described it in their paper and how that correlates with real data. I have great doubts about practicality or accuracy of their described method. I will post my reasons in detail later. In the meantime I am very interested to see you TapeStation results for your digests and your rational for the way you have estimated fragment numbers in your target range.

                    Comment


                    • #11
                      Good point! I was thinking that EcoRI would cut every 3kb so the problem would be negligible, but there will be plenty of MspI - MspI fragments in the 300 bp range that will not become ddRAD loci. The PstI - EcoRI double digest should be more accurate, but I guess even there for every Pst-EcoRI fragment you'll have an equal number of PstI-PstI fragments and equal number of EcoRI-EcoRI fragments in the mix. Actually, many more of the EcoRI-EcoRI fragments given the GC content.
                      Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                      Comment


                      • #12
                        TapeStation results

                        Thank you for your comments nucacidhunter and SNPsaurus.
                        I guess I should explain my math. I was trying to calculated the number of expected loci for a given size selection (300+/-36) so I calculated the proportion of DNA in the size selection (.47ng/19.4ng) and then multiple by the size of the genome (2.42Gbps) to get the number of bps with in the selection. I then divided by the average size of the fragment (300bps) to get the number of fragments within the size selection. Finally I divide by 2 to get the number of loci, since the organism is diploid. However as nucacidhunter pointed out I forgot to take into account that there are fragments that are not sequence-able (e.g. EcoRI-EcoRI and MspI-MspI). To account for this I am just going to divide by 2 (thus assuming ~50% of the fragments are sequence-able).

                        242000000bps * 0.47ng / 19.4ng / 300bps / 2 / 2 = 4885 loci

                        I know it is a big assumption that 50% of the fragments are sequence-able, but after running some simulations this seems to work as long as you are not selecting any place were the slope of the distribution of fragments is high (this includes the slope for all three fragment types). Thus if I go with a size selection around 300 I get a fairly consistent estimate that matches my in silico estimates. However if I do the some for a size selection around 224+/-36 I get a much more erratic result.

                        So you may be wondering way I am doing this and not just going with the in silico estimates. Most of the organisms that I work with do not have a reference genome for any species closely related. Thus, I want to develop a system to estimate the number of fragments completely de novo.

                        As per request I tried to attached the TapeStation results. However the .doc file and the raw result file is too each too large. If you want the results to look at just send me an email at [email protected].

                        If you do want the results there are two files 1) .doc file of results given to me by the facility and 2) the raw data from the TapeStation. You may noticed that there is a warning for some samples that the concentration is too low. I thought they wanted the concentrations to be between 1ng/ul - 50ng/ul (as per normal D1000 DNA tapes) but since we were using the genomic DNA tape the concentrations were supposed to be >20ng/ul. They still ran fine and the total concentration from the TapeStation matches that of the Qubit. If you want to play with the raw data you can download the program for free. If anyone wants to know the total contractions from the Qubit or want to know what the genome sizes of the fish (four species) we are used or the different RE combinations we are used (four combinations) just let me know. This message is already way to long.

                        Hope this is hopeful and thank you for the fast responses.

                        Comment


                        • #13
                          I would still worry about MspI-MspI fragments. That's 4 nucs that are GC, in a genome that is 38% GC, so you'll get a MspI site every 700 bp or so. If there is a 699/700 chance of not getting that site at some particular nuc, then in a 72 bp region there is a 10% chance of seeing a site (so if you have a MspI site somewhere, then 10% of those will have another MspI site 300 bp away). The same logic applies to MspI sites near EcoRI or PstI sites. But there are 10X more MspI sites to start with than PstI, so you need to divide by more than 2. EcoRI is better (4 fold more MspI sites).
                          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                          Comment


                          • #14
                            Based on my experience I would advise against using gDNA ScreenTape for sizing your fragments. It is dependent on load and serial dilution of the same sample (within the range specified for them) will give different size outcomes. In addition, the approach that you are taking (estimating fragment number based on digestion result) may hinder successful library prep in some occasions. If the size window that you are selecting comprises fragments from repeat regions and organelles you may not have many useful SNPs to call.

                            So you may be wondering way I am doing this and not just going with the in silico estimates. Most of the organisms that I work with do not have a reference genome for any species closely related. Thus, I want to develop a system to estimate the number of fragments completely de novo.
                            The issue with in silico approach is that during actual size selection with Pippin, eluted fragments will be different from set point because one end of fragments has a Y shape adapter and that affects the migration speed. Other issue is that Pippin size selection is also load dependent and one can expect different results based on DNA amount loaded on them.
                            Last edited by nucacidhunter; 05-23-2014, 07:46 PM.

                            Comment


                            • #15
                              nucacidhunter, I have often wondered about the repeats in a particular size range issue. When my lab was working on RAD-Seq, one of the reasons we liked having one side of the RAD tag be sheared is that we saw talks from people doing RRL and how they spent so much time checking different size ranges for repeat content... it took longer to decide on a size range than to do the actual experiment. But I don't hear about that from ddRAD or GBS talks. Is it just that sequencing has gotten cheaper and it isn't worth getting fussed over losing 25M reads to repeats?

                              Good point about the Y-adapters as well. Without a good reference genome it is hard to feel that confident that the number of sites will translate from a related genome anyway, so sometimes a person has to just plunge ahead and try it!
                              Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X