SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
How can estimate the required coverage adrian Bioinformatics 0 02-16-2012 01:36 PM
How can I estimate overall coverage against a reference database? dacotahm Bioinformatics 1 11-22-2011 04:01 PM
Average Read Coverage for 454 paired end read data lisa1102 Core Facilities 8 10-18-2011 08:40 AM
SOLID 5500 - average coverage - rare events nork SOLiD 0 07-21-2011 11:54 PM
Looking for access to Illumina for RAD sequencing project tlking Illumina/Solexa 0 11-02-2009 08:05 AM

Reply
 
Thread Tools
Old 01-26-2012, 03:19 AM   #1
fcr
Member
 
Location: Seville

Join Date: Jan 2012
Posts: 19
Lightbulb How to Estimate the Average Coverage per RAD Locus

Hi all,

I am trying to estimate the sequencing effort needed for a valid SNP call using RADtags on a non-model organism.

To have an idea of the average cover per locus I used the following equation:

[No. Illumina Reads x read length] / [No. Individuals x Expected RAD sites across the genome x 2 RADtags per RAD site x average length of each RAD tag]


For example, let's say we have a non-model species with a genome size of 350 Mb. We digest the DNA with and an 8-cutter RE such as SbfI. Assuming equal proportion of each base across (A=T=G=C=0.25) the genome we'll expect:

(0.25^8) x 350,000,000 = 5,340 RAD sites


After the size selection step the average RAD tag size will be 500 bp. If we use one Lane of an Illumina GAIIx (35 million reads of 75bp) to sequence 100 pooled individuals then the coverage per RAD locus would be:

(35,000,000 x 75) / (500 x 2 x 5340 x 100) = 4.9 X

As for de novo species is recommended at least 60 X (Davey et al. 2011 Nature Genetics) I think it will be necessary to use 8 individuals per lane.

Are my calculations right? what do you think about the final decision?
fcr is offline   Reply With Quote
Old 01-30-2012, 06:12 AM   #2
Cofactor Genomics
Registered Vendor
 
Location: St. Louis

Join Date: Jan 2010
Posts: 52
Default

Hello fcr,
Your calculation of the RAD sites is correct, however instead of using the size of the RAD tags, since you won't be sequencing the tags, only the ends, this is how the math would look on coverage:

5,340 RAD fragments
10,680 RAD Sites (cut ends of fragments)


If you wanted 60x coverage of these, you would need a minimum of:

10,680 x 60 = 640,800 sequences needed for each sample
(At 2x75 bp reads, this would be 48 Mb of sequence needed per sample)

At this level of sequencing you would be able to multiplex approximately:

35,000,000 / 640,800 = 54 samples per lane

You can calculate all of the the math in sequence space. You do not need to go into base sequence to make these calculations, which is what you were doing above by including the length of the fragment. The sequence read length does not make a difference on how many RAD ends you will sample (sample = sequence), this is determined by the output of the machine. The read length will define how far you read into the fragments for SNP detection. Also, I would definitely recommend sequencing these as paired-ends.

Please let me know if you have any other questions.

jon
Cofactor Genomics is offline   Reply With Quote
Old 01-30-2012, 09:15 AM   #3
fcr
Member
 
Location: Seville

Join Date: Jan 2012
Posts: 19
Default

Thanks for your answer Jon,

I have two more questions ...

First-To make sure I got it right. I calculated the coverage after sequencing in the next example using Holenhole's data for Stickleback. In their case 8 individuals were barcoded and sequenced in one lane. Holenlohe et al. provided a supplementary table, where for a single run they provide the following information:

Individuals= 16
Total # reads= 8895289
Barcoded reads = 8269024
Aligned reads= 6497736
RAD tags= 41590
Sequence length= 26
Nucleotides=1094434

Then following your explanation above I would say that the coverage per RAD site and indivual would be calculated as Aligned Reads/RAD tags x Individuals, thus resulting in 6497736/ (41590 x 16) = 9.76 X coverage

Second: I think your explanation applies to single-end reads. What happen if you use pair-end reads? (barcoding and pooling individuals on a lane)


Cheers,
Fernando
fcr is offline   Reply With Quote
Old 01-30-2012, 02:49 PM   #4
Cofactor Genomics
Registered Vendor
 
Location: St. Louis

Join Date: Jan 2010
Posts: 52
Default

fcr,
This is correct. One thing that can be confusing to some is the difference between physical coverage and sequence coverage. Physical coverage is what I used for the math above, and in general, can be used to figure the coverage on the ends. Sequence coverage is mainly used when DNA (or RNA) is randomly fragmented and sequenced to a certain depth of coverage against a reference. Sequence coverage is useful to consider when sequence reads are generated from fragment ends that are assumed to be randomly distributed across a genome. Therefore, one can figure the average sequence coverage by taking the number of aligned bases divided by the total number of reference bases. RAD-seq sequences ends that are generated in a very non-random way and thus can be modeled by the physical coverage, since all of the reads should have the same start site from the P1 adapter. After writing this I realized that you gain nothing from sequencing these as paired-ends since the 2 read will only be randomly distributed through the fragment, and not very useful for comparative SNPs.

Hope this helps.

jon
Cofactor Genomics is offline   Reply With Quote
Old 01-30-2012, 10:23 PM   #5
fcr
Member
 
Location: Seville

Join Date: Jan 2012
Posts: 19
Default

Hi Jon,

Actually, I thought that the main purpose of using pair-end seq. with RADs was helping to remove PCR duplicates. As fragments of equal size with identical sequence at both ends are unlikely to occur with random shearing.


Thanks a lot!
Fernando
fcr is offline   Reply With Quote
Old 01-31-2012, 08:22 AM   #6
Cofactor Genomics
Registered Vendor
 
Location: St. Louis

Join Date: Jan 2010
Posts: 52
Default

Dear Fernando,
Good point! I was only considering the utility of comparative SNPs, which of course would need to have deduplication. Great point. Paired-end it is.

jon
Cofactor Genomics is offline   Reply With Quote
Old 02-01-2012, 03:09 AM   #7
fcr
Member
 
Location: Seville

Join Date: Jan 2012
Posts: 19
Default

Hi Jon,

Using pair-end reads in Illumina platform will also double the coverage (~70 million pair end-reads instead of 35 million single-reads for a GAIIx Lane). Do you know if using pair-end reads is far more expensive?

I was discussing with my colleagues about using RADs through a technical service. However, we think is worthy to perform a pilot experiment in order to fine-tune the sequencing effort. Do you think to digest DNA for one individual with SbfI and estimate the number of RAD fragments obtained after size selection? I guess it can't be be done without sequencing because is necessary to identify he sequences containing the adapters.

I was checking your website but it seems that your company don't do RADs. Do you know any company offering this service? If they have an European headquarter will be better for us...

Cheers,
Fernando
fcr is offline   Reply With Quote
Old 02-07-2012, 08:25 AM   #8
Cofactor Genomics
Registered Vendor
 
Location: St. Louis

Join Date: Jan 2010
Posts: 52
Default

fcr,
sorry for my silence, I was traveling.

Using pair-end reads in Illumina platform will also double the coverage (~70 million pair end-reads instead of 35 million single-reads for a GAIIx Lane).

It will double the coverage, however it does not give you any additional RAD SNP power. The P1 adapter end is the one with the RAD flanking region and with single end sequencing you would generate ~30 million reads from those ends, however when you add paired-end reads, you then generate ~60 million reads, split across both ends of the fragment (P1 and P2 adapter), which still gives you ~30 million reads at the P1 end.

I was discussing with my colleagues about using RADs through a technical service. However, we think is worthy to perform a pilot experiment in order to fine-tune the sequencing effort. Do you think to digest DNA for one individual with SbfI and estimate the number of RAD fragments obtained after size selection? I guess it can't be be done without sequencing because is necessary to identify he sequences containing the adapters.

My initial thought would be to go ahead and use 3-4 RE to cut the DNA, if DNA quantity is not a concern. That part will take the longest time anyway, so you might as well run several in parallel. You can then barcode the libraries and sequence them in one lane and gain more than n=1.

I was checking your website but it seems that your company don't do RADs. Do you know any company offering this service? If they have an European headquarter will be better for us..

Floragenex in the US has IP wrapped around the original protocol, thus they are one of the few suppliers. We are in the process of developing a novel and more cost effective protocol, however it is not prime-time yet.

Please let me know if you have additional questions. I am glad to help.

Jon

Last edited by Cofactor Genomics; 02-07-2012 at 11:25 PM.
Cofactor Genomics is offline   Reply With Quote
Old 02-08-2012, 12:53 AM   #9
fcr
Member
 
Location: Seville

Join Date: Jan 2012
Posts: 19
Default

Hi Jon,

Thanks a lot for your answer. No worries for the delay

In mean time, we already have contacted Floragenex and we contemplate GBS instead of RADs. But I guess we'll obtain lower number of fragments than with RAD for same enzyme...

I'll definitely come back to you for advice.

Cheers,
FCR
fcr is offline   Reply With Quote
Old 02-08-2012, 01:03 AM   #10
Cofactor Genomics
Registered Vendor
 
Location: St. Louis

Join Date: Jan 2010
Posts: 52
Default

Dear fcr,
With the current cost per base of the high output machines such as the HiSeq2000, it may make sense to perform GBS as opposed to RAD depending on your goals as RAD was originally developed for arrays and then ported over to sequencing. RRS (reduced representation sequencing) can be an excellent strategy if hardware changes are in place to compensate for the low-complexity cut sites that can cause problems on the ILMN machines. Please feel free to contact me off the board if you would like some suggestion as to coverage and SNP/indel cutoffs as well as strategies to reduce low quality due to unequal base distribution at the cut sites. Best of luck!
jon
Cofactor Genomics is offline   Reply With Quote
Old 02-08-2012, 01:07 AM   #11
Cofactor Genomics
Registered Vendor
 
Location: St. Louis

Join Date: Jan 2010
Posts: 52
Default

Dear fcr,
One question that I did not address above is the cost of paired-end versus single-ends. Of course, it depends on the platform, however on the ILMN, it is ~1.5x cost of single-ends.

jon
Cofactor Genomics is offline   Reply With Quote
Old 02-08-2012, 01:58 AM   #12
fcr
Member
 
Location: Seville

Join Date: Jan 2012
Posts: 19
Default

Hi Jon,

Actually with current costs we discussed doing GBS using PstI 48 individuals. It brings associated sequencing on a HiSeq Lane (~100 million single reads). So assuming we'll get 10,000 RAD fragments for our 350Mb genome, the coverage per RAD site will be ~195 X. I don't know in the end too much coverage will complicate things, such as identifying individual locus and calling SNPs...what do you think?

On top of that, GBS will produce less fragments than RADs, I guess half of them, doubling the coverage. making it even worse, right?

I don't really know what you actually meant with "if hardware changes are in place to compensate for the low-complexity cut sites that can cause problems on the ILMN machines." could you clarify this?

I would really appreciate suggestions to coverage and SNP/indel cutoffs as well as strategies to reduce low quality due to unequal base distribution at the cut sites.


Thanks again,
Fernando
fcr is offline   Reply With Quote
Reply

Tags
coverage, de novo snp calling, radtags, sequencing-effort

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:20 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO