SEQanswers How to Estimate the Average Coverage per RAD Locus
 User Name Remember Me? Password
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

 Similar Threads Thread Thread Starter Forum Replies Last Post adrian Bioinformatics 0 02-16-2012 01:36 PM dacotahm Bioinformatics 1 11-22-2011 04:01 PM lisa1102 Core Facilities 8 10-18-2011 08:40 AM nork SOLiD 0 07-21-2011 11:54 PM tlking Illumina/Solexa 0 11-02-2009 08:05 AM

 01-26-2012, 03:19 AM #1 fcr Member   Location: Seville Join Date: Jan 2012 Posts: 19 How to Estimate the Average Coverage per RAD Locus Hi all, I am trying to estimate the sequencing effort needed for a valid SNP call using RADtags on a non-model organism. To have an idea of the average cover per locus I used the following equation: [No. Illumina Reads x read length] / [No. Individuals x Expected RAD sites across the genome x 2 RADtags per RAD site x average length of each RAD tag] For example, let's say we have a non-model species with a genome size of 350 Mb. We digest the DNA with and an 8-cutter RE such as SbfI. Assuming equal proportion of each base across (A=T=G=C=0.25) the genome we'll expect: (0.25^8) x 350,000,000 = 5,340 RAD sites After the size selection step the average RAD tag size will be 500 bp. If we use one Lane of an Illumina GAIIx (35 million reads of 75bp) to sequence 100 pooled individuals then the coverage per RAD locus would be: (35,000,000 x 75) / (500 x 2 x 5340 x 100) = 4.9 X As for de novo species is recommended at least 60 X (Davey et al. 2011 Nature Genetics) I think it will be necessary to use 8 individuals per lane. Are my calculations right? what do you think about the final decision?
 01-30-2012, 06:12 AM #2 Cofactor Genomics Registered Vendor   Location: St. Louis Join Date: Jan 2010 Posts: 52 Hello fcr, Your calculation of the RAD sites is correct, however instead of using the size of the RAD tags, since you won't be sequencing the tags, only the ends, this is how the math would look on coverage: 5,340 RAD fragments 10,680 RAD Sites (cut ends of fragments) If you wanted 60x coverage of these, you would need a minimum of: 10,680 x 60 = 640,800 sequences needed for each sample (At 2x75 bp reads, this would be 48 Mb of sequence needed per sample) At this level of sequencing you would be able to multiplex approximately: 35,000,000 / 640,800 = 54 samples per lane You can calculate all of the the math in sequence space. You do not need to go into base sequence to make these calculations, which is what you were doing above by including the length of the fragment. The sequence read length does not make a difference on how many RAD ends you will sample (sample = sequence), this is determined by the output of the machine. The read length will define how far you read into the fragments for SNP detection. Also, I would definitely recommend sequencing these as paired-ends. Please let me know if you have any other questions. jon
 01-30-2012, 09:15 AM #3 fcr Member   Location: Seville Join Date: Jan 2012 Posts: 19 Thanks for your answer Jon, I have two more questions ... First-To make sure I got it right. I calculated the coverage after sequencing in the next example using Holenhole's data for Stickleback. In their case 8 individuals were barcoded and sequenced in one lane. Holenlohe et al. provided a supplementary table, where for a single run they provide the following information: Individuals= 16 Total # reads= 8895289 Barcoded reads = 8269024 Aligned reads= 6497736 RAD tags= 41590 Sequence length= 26 Nucleotides=1094434 Then following your explanation above I would say that the coverage per RAD site and indivual would be calculated as Aligned Reads/RAD tags x Individuals, thus resulting in 6497736/ (41590 x 16) = 9.76 X coverage Second: I think your explanation applies to single-end reads. What happen if you use pair-end reads? (barcoding and pooling individuals on a lane) Cheers, Fernando
 01-30-2012, 02:49 PM #4 Cofactor Genomics Registered Vendor   Location: St. Louis Join Date: Jan 2010 Posts: 52 fcr, This is correct. One thing that can be confusing to some is the difference between physical coverage and sequence coverage. Physical coverage is what I used for the math above, and in general, can be used to figure the coverage on the ends. Sequence coverage is mainly used when DNA (or RNA) is randomly fragmented and sequenced to a certain depth of coverage against a reference. Sequence coverage is useful to consider when sequence reads are generated from fragment ends that are assumed to be randomly distributed across a genome. Therefore, one can figure the average sequence coverage by taking the number of aligned bases divided by the total number of reference bases. RAD-seq sequences ends that are generated in a very non-random way and thus can be modeled by the physical coverage, since all of the reads should have the same start site from the P1 adapter. After writing this I realized that you gain nothing from sequencing these as paired-ends since the 2 read will only be randomly distributed through the fragment, and not very useful for comparative SNPs. Hope this helps. jon
 01-30-2012, 10:23 PM #5 fcr Member   Location: Seville Join Date: Jan 2012 Posts: 19 Hi Jon, Actually, I thought that the main purpose of using pair-end seq. with RADs was helping to remove PCR duplicates. As fragments of equal size with identical sequence at both ends are unlikely to occur with random shearing. Thanks a lot! Fernando
 01-31-2012, 08:22 AM #6 Cofactor Genomics Registered Vendor   Location: St. Louis Join Date: Jan 2010 Posts: 52 Dear Fernando, Good point! I was only considering the utility of comparative SNPs, which of course would need to have deduplication. Great point. Paired-end it is. jon
 02-01-2012, 03:09 AM #7 fcr Member   Location: Seville Join Date: Jan 2012 Posts: 19 Hi Jon, Using pair-end reads in Illumina platform will also double the coverage (~70 million pair end-reads instead of 35 million single-reads for a GAIIx Lane). Do you know if using pair-end reads is far more expensive? I was discussing with my colleagues about using RADs through a technical service. However, we think is worthy to perform a pilot experiment in order to fine-tune the sequencing effort. Do you think to digest DNA for one individual with SbfI and estimate the number of RAD fragments obtained after size selection? I guess it can't be be done without sequencing because is necessary to identify he sequences containing the adapters. I was checking your website but it seems that your company don't do RADs. Do you know any company offering this service? If they have an European headquarter will be better for us... Cheers, Fernando
 02-08-2012, 12:53 AM #9 fcr Member   Location: Seville Join Date: Jan 2012 Posts: 19 Hi Jon, Thanks a lot for your answer. No worries for the delay In mean time, we already have contacted Floragenex and we contemplate GBS instead of RADs. But I guess we'll obtain lower number of fragments than with RAD for same enzyme... I'll definitely come back to you for advice. Cheers, FCR
 02-08-2012, 01:03 AM #10 Cofactor Genomics Registered Vendor   Location: St. Louis Join Date: Jan 2010 Posts: 52 Dear fcr, With the current cost per base of the high output machines such as the HiSeq2000, it may make sense to perform GBS as opposed to RAD depending on your goals as RAD was originally developed for arrays and then ported over to sequencing. RRS (reduced representation sequencing) can be an excellent strategy if hardware changes are in place to compensate for the low-complexity cut sites that can cause problems on the ILMN machines. Please feel free to contact me off the board if you would like some suggestion as to coverage and SNP/indel cutoffs as well as strategies to reduce low quality due to unequal base distribution at the cut sites. Best of luck! jon
 02-08-2012, 01:07 AM #11 Cofactor Genomics Registered Vendor   Location: St. Louis Join Date: Jan 2010 Posts: 52 Dear fcr, One question that I did not address above is the cost of paired-end versus single-ends. Of course, it depends on the platform, however on the ILMN, it is ~1.5x cost of single-ends. jon
 02-08-2012, 01:58 AM #12 fcr Member   Location: Seville Join Date: Jan 2012 Posts: 19 Hi Jon, Actually with current costs we discussed doing GBS using PstI 48 individuals. It brings associated sequencing on a HiSeq Lane (~100 million single reads). So assuming we'll get 10,000 RAD fragments for our 350Mb genome, the coverage per RAD site will be ~195 X. I don't know in the end too much coverage will complicate things, such as identifying individual locus and calling SNPs...what do you think? On top of that, GBS will produce less fragments than RADs, I guess half of them, doubling the coverage. making it even worse, right? I don't really know what you actually meant with "if hardware changes are in place to compensate for the low-complexity cut sites that can cause problems on the ILMN machines." could you clarify this? I would really appreciate suggestions to coverage and SNP/indel cutoffs as well as strategies to reduce low quality due to unequal base distribution at the cut sites. Thanks again, Fernando

 Tags coverage, de novo snp calling, radtags, sequencing-effort

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules

All times are GMT -8. The time now is 10:20 AM.