Seqanswers Leaderboard Ad

**swbarnes2** · 11-08-2011, 12:37 PM

Why are you multiplying 120 million reads by 200, if each read is 100 bases long? A read is one end, a cluster has two reads.

It's 120x by those calcuations, but obviously not every read will fall on target, so it will be lower than that.

**arvi8689** · 11-08-2011, 12:59 PM

It can read 120 million fragments and each fragment will be read twice with 100pb length. So i thought I will get twice of it.

**swbarnes2** · 11-09-2011, 01:17 PM

I think you are conflating fragments and clusters and reads.

One read is just one read. One fragment generates one cluster on the Illumina flow cell, and two reads come from that one cluster.

If you were told 120 million reads, like you write in your first post, then you don't double that again. If you were told 120 million clusters, that 240 million reads at 100 bp each.

**simonandrews** · 11-10-2011, 12:53 AM

It's worth remembering that with 100bp reads you'll get a reasonable proportion of your library where there will be an overlap between the ends of reads 1 and 2 so this will reduce your effective coverage. There will even be plenty of sequences where read 2 provides no additional coverage (where read1 reads right through the insert into the other end adapter).

**pmiguel** · 11-10-2011, 11:13 AM

Originally posted by simonandrews View Post

It's worth remembering that with 100bp reads you'll get a reasonable proportion of your library where there will be an overlap between the ends of reads 1 and 2 so this will reduce your effective coverage. There will even be plenty of sequences where read 2 provides no additional coverage (where read1 reads right through the insert into the other end adapter).

"coverage", to me, means average read depth. Like "my 1.5 billion bases of reads gives me 10x coverage of the arabidopsis genome." By this definition, two 100 nt reads from a 100 bp insert would provide double the effective coverage of just one read.

You seem to be referring to what I would call "% of genome covered".

--
Phillip

**simonandrews** · 11-11-2011, 03:36 AM

Originally posted by pmiguel View Post

"coverage", to me, means average read depth. Like "my 1.5 billion bases of reads gives me 10x coverage of the arabidopsis genome." By this definition, two 100 nt reads from a 100 bp insert would provide double the effective coverage of just one read.

I suppose this comes down to where you think your errors will occur. Resequencing the same fragment multiple times will help to correct sequencing errors, but won't help if the fragment picked up a PCR error during library preparation.

I guess I tend to think in terms of epigenetics where there isn't a single fixed epigenome to measure, so the distinction between two reads from the same fragment and two reads from different fragments actually matters. If you're only concerned with sequencing errors then I guess you count overlapping reads equally.

**csquared** · 11-11-2011, 03:53 PM

A quick and dirty estimation of final coverage in a sequence capture experiment using a hybridization based method is to assume about 50% efficiency.

Looking at the summary data over a few dozen different custom captures and a few thousand exome captures from Agilent and Nimblegen, a reasonable estimation of depth of coverage from total sequence data is to assume about a 50% efficiency in the entire process.

For example, if your capture region is 100Mb and your total sequence yield is 5Gb, your coverage would be 50x if every sequence read aligned within the capture region and everything was 100% efficient and evenly distributed. In reality, you will see median coverages in the 25x range once all of the inefficiencies are accounted for.

If you want to calculate the amount of sequence needed for a particular scenario, say to cover at least 80% of the capture region to at least 20x, the relationship is not linear but more exponential and can be approximated by:

To have at least 70% of the capture region covered at 'Y' coverage, multiply 'Y' by 2 to estimate the median coverage needed.
To have at least 80% of the capture region covered at 'Y' coverage, multiply 'Y' by 4 to estimate the median coverage needed.
To have at least 90% of the capture region covered at 'Y' coverage, multiply 'Y' by 7 to estimate the median coverage needed.

All of the above are based on human exome capture. YRMV.

A number of factors influence the final numbers including sequencing read length, insert size, specificity of the capture reagent/region, etc. The 50% is a very good estimation for mammalian species. Really don't know how well it would apply to other organisms, but suspect it would be close.

Similar to Simon, we have found mostly minor issues introduced in variant calling when the same physical fragment is sequenced twice, resulting in over-statement of variant quality scores. The effects of sequencing the same fragment on data produced for sequencing census methods (ChIPseq, RNAseq, Methylseq) is substantially more pronounced in that you double count short fragments and introduce an insert length dependent bias in the data.

If the paired reads overlap following duplicate removal, we trim them back at the BAM stage to allow the reads to meet end to end. During the trim, the exact proportion of overlapping bases can be tracked to provide a summary report of the total bases removed.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

coverage calculation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News