Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coverage calculation

    Hey

    I am trying to sequence the exome and the capture kit is 100MB

    The sequencing core promised 120 million reads per lane and we are using paired end 100bp reads and our fragment size is 250 basepairs.

    My calculation was I will get 120 million reads * 200= 240 million bases read

    so coverage= 240 million bases/100MB= 240x coverage (average)

    But some people say I will get a coverage of only 120x. What could be the reason? Or is the coverage actually 240x?

  • #2
    Why are you multiplying 120 million reads by 200, if each read is 100 bases long? A read is one end, a cluster has two reads.

    It's 120x by those calcuations, but obviously not every read will fall on target, so it will be lower than that.

    Comment


    • #3
      It can read 120 million fragments and each fragment will be read twice with 100pb length. So i thought I will get twice of it.

      Comment


      • #4
        I think you are conflating fragments and clusters and reads.

        One read is just one read. One fragment generates one cluster on the Illumina flow cell, and two reads come from that one cluster.

        If you were told 120 million reads, like you write in your first post, then you don't double that again. If you were told 120 million clusters, that 240 million reads at 100 bp each.

        Comment


        • #5
          It's worth remembering that with 100bp reads you'll get a reasonable proportion of your library where there will be an overlap between the ends of reads 1 and 2 so this will reduce your effective coverage. There will even be plenty of sequences where read 2 provides no additional coverage (where read1 reads right through the insert into the other end adapter).

          Comment


          • #6
            Originally posted by simonandrews View Post
            It's worth remembering that with 100bp reads you'll get a reasonable proportion of your library where there will be an overlap between the ends of reads 1 and 2 so this will reduce your effective coverage. There will even be plenty of sequences where read 2 provides no additional coverage (where read1 reads right through the insert into the other end adapter).
            "coverage", to me, means average read depth. Like "my 1.5 billion bases of reads gives me 10x coverage of the arabidopsis genome." By this definition, two 100 nt reads from a 100 bp insert would provide double the effective coverage of just one read.

            You seem to be referring to what I would call "% of genome covered".

            --
            Phillip

            Comment


            • #7
              Originally posted by pmiguel View Post
              "coverage", to me, means average read depth. Like "my 1.5 billion bases of reads gives me 10x coverage of the arabidopsis genome." By this definition, two 100 nt reads from a 100 bp insert would provide double the effective coverage of just one read.
              I suppose this comes down to where you think your errors will occur. Resequencing the same fragment multiple times will help to correct sequencing errors, but won't help if the fragment picked up a PCR error during library preparation.

              I guess I tend to think in terms of epigenetics where there isn't a single fixed epigenome to measure, so the distinction between two reads from the same fragment and two reads from different fragments actually matters. If you're only concerned with sequencing errors then I guess you count overlapping reads equally.

              Comment


              • #8
                A quick and dirty estimation of final coverage in a sequence capture experiment using a hybridization based method is to assume about 50% efficiency.

                Looking at the summary data over a few dozen different custom captures and a few thousand exome captures from Agilent and Nimblegen, a reasonable estimation of depth of coverage from total sequence data is to assume about a 50% efficiency in the entire process.

                For example, if your capture region is 100Mb and your total sequence yield is 5Gb, your coverage would be 50x if every sequence read aligned within the capture region and everything was 100% efficient and evenly distributed. In reality, you will see median coverages in the 25x range once all of the inefficiencies are accounted for.

                If you want to calculate the amount of sequence needed for a particular scenario, say to cover at least 80% of the capture region to at least 20x, the relationship is not linear but more exponential and can be approximated by:

                To have at least 70% of the capture region covered at 'Y' coverage, multiply 'Y' by 2 to estimate the median coverage needed.
                To have at least 80% of the capture region covered at 'Y' coverage, multiply 'Y' by 4 to estimate the median coverage needed.
                To have at least 90% of the capture region covered at 'Y' coverage, multiply 'Y' by 7 to estimate the median coverage needed.

                All of the above are based on human exome capture. YRMV.

                A number of factors influence the final numbers including sequencing read length, insert size, specificity of the capture reagent/region, etc. The 50% is a very good estimation for mammalian species. Really don't know how well it would apply to other organisms, but suspect it would be close.


                Similar to Simon, we have found mostly minor issues introduced in variant calling when the same physical fragment is sequenced twice, resulting in over-statement of variant quality scores. The effects of sequencing the same fragment on data produced for sequencing census methods (ChIPseq, RNAseq, Methylseq) is substantially more pronounced in that you double count short fragments and introduce an insert length dependent bias in the data.

                If the paired reads overlap following duplicate removal, we trim them back at the BAM stage to allow the reads to meet end to end. During the trim, the exact proportion of overlapping bases can be tracked to provide a summary report of the total bases removed.
                HudsonAlpha Institute for Biotechnology
                http://www.hudsonalpha.org/gsl

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 11:49 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-24-2024, 08:47 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                62 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Working...
                X