Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correlation between sequencing depth and false positives

    Hello,

    I am looking for documentation about the correlation between sequencing depth and false discovery rate.
    I mean, if there is a SNPs at a position and if the coverage at this position is low, is the probability to detect this snp lower than if the coverage was high ?
    Or if a differentially expressed gene has a low coverage on the average, is the probability to detect the gene as differentially expressed lower than if the coverage was high?

    Do you know if there are studies or papers about this point?
    Thanks in advance,
    Jane

  • #2
    This article was reviewed by Rohan Williams (nominated by Gavin Huttley), Nicole Cloonan (nominated by Mark Ragan) and James Bullard (nominated by Sandrine Dudoit).

    Comment


    • #3
      Thanks for the paper.
      I am in particular interested in the effect of sequencing depth on SNPs and indels detection. Are there papers on this topic?

      I would like to know if there is a SNP at a position, do we have the same probability to detect it if the coverage is low than if the coverage is high, assuming we have the same sequencing error?

      Thanks,
      Jane

      Comment


      • #4
        Originally posted by Jane M View Post
        I would like to know if there is a SNP at a position, do we have the same probability to detect it if the coverage is low than if the coverage is high, assuming we have the same sequencing error?
        All other things equal, the probability of detecting a SNP at high-coverage will be higher than the probability of detecting a SNP at low coverage. The error rate at a particular location will decrease with repeated sampling, increasing the reliability of measurement.

        This is not a particularly meaningful statement. Of course there will be some point where an increase in quality won't significantly increase the reliability of measurement (e.g. phred score of 40 or so, considering repeated sampling). However, in almost all cases the actual SNP frequency will play a greater role in detection, and the difference in detection probability will be insignificant for high-frequency SNPs and for very low-frequency SNPs.

        If the chance of a polymorphism is near 50%, then you'd need a coverage of less than 6 or so over a region (my ball-park guess) to miss repeated observations of both variants of a dimorphic SNP. Conversely, for a SNP (depending on the definition of SNP) with frequency less than 1%, you'd have to be quite lucky to get any sample that has the variant of interest.

        Comment


        • #5
          Or if a differentially expressed gene has a low coverage on the average, is the probability to detect the gene as differentially expressed lower than if the coverage was high?
          This is quite a different question from the SNP question, because there are two dimensions of measurement that influence the probability that differential expression is significant even when just considering the read counts at a single base-pair location (number of raw reads, and fold-change difference). A low number of raw reads increases the measurement error, increasing the fold-change difference that would need to be observed for a differential expression to be considered significant (note: raw read counts, not normalised read counts).

          Again, with all other things equal, a high coverage will increase the reliability of the result, but this time it has a much greater role to play in determining whether the expression difference is significant.

          Unfortunately, there are plenty of other confounding factors, such that differential expression analysis by NGS can really only be used for fishing / hypothesis generation. Off the top of my head, there's multiply-mapped reads, multiple isoforms / splice variants, incomplete coverage of the gene / transcript, PCR duplicates, and incorrect gene annotation. Some of these situations can be identified by looking at coverage plots at a transcript level, but that requires too much effort and human intervention to work at a genome-wide scale.

          If you really want to doubt the reliability of your results, look at the coefficient of variation for coverage in all transcripts (SD of coverage divided by mean coverage). The last time I looked at that, I think about 70~125% described a "good" coverage, and most transcripts were over something like 300%. I'd be interested to know other people's experience regarding this matter.

          Comment


          • #6
            Thanks a lot for your answer gringer!

            I must admit that currently, I'm particularly interested in the detection of SNPs. So I would like to have an idea about the reliability of my results when having low coverage.
            Because I detect variant in these 2 extreme cases :
            -3 reads for the reference and 3 reads for the variant
            -100 reads for the reference and 100 reads for the variant

            Has someone estimated the reliability of results depending on sequencing depth? Gringer, can you suggest me publications about it?

            Jane

            Comment


            • #7
              Because I detect variant in these 2 extreme cases :
              -3 reads for the reference and 3 reads for the variant
              -100 reads for the reference and 100 reads for the variant
              That's not a particularly extreme case. It suggests SNP frequencies of 50%, which means coverage is not going to matter. Of course for a heterozygous sample, this is expected. Are these reads for a single sample (i.e. you're looking at a heterozygous sample), or for multiple samples? You should be doing your SNP detection using pooled reads for all samples, and then type according to this. A more interesting case (for a single sample) would be something like the following:

              SNP 1: 1 read for the reference and 5 reads for the variant [probably homozygous variant, but small possibility of heterozygote]
              SNP 2: 20 reads for the reference and 80 reads for the variant [small possibility of heterozygote, but the imbalance of counts suggests there might be multiple read hits in the genome]

              With sanger sequencing, two observations of a variant (in a population) are typically enough to consider the variant as being present, bearing in mind that a typical definition of a SNP is for a frequency greater than 1% (or possibly 5%). I expect it would be similar for NGS. I think the SNP microarrays use a few replicate sequences per variant (e.g. see here), just to be safe.

              Edit:

              can you suggest me publications about it?
              I'm not aware of any NGS publications relating to SNP discovery (because I haven't looked), but for "classical" SNP detection I guess you could look at the Wikipedia references:
              Last edited by gringer; 02-28-2012, 03:36 AM. Reason: added wikipedia link, affy reference

              Comment


              • #8
                Originally posted by gringer View Post
                That's not a particularly extreme case. It suggests SNP frequencies of 50%, which means coverage is not going to matter. Of course for a heterozygous sample, this is expected. Are these reads for a single sample (i.e. you're looking at a heterozygous sample), or for multiple samples? You should be doing your SNP detection using pooled reads for all samples, and then type according to this. A more interesting case (for a single sample) would be something like the following:

                SNP 1: 1 read for the reference and 5 reads for the variant [probably homozygous variant, but small possibility of heterozygote]
                SNP 2: 20 reads for the reference and 80 reads for the variant [small possibility of heterozygote, but the imbalance of counts suggests there might be multiple read hits in the genome]
                The examples that I gave are not especially something that I've got, maybe I have it, I have hundreds of variants...

                My questions are related to the examples that I gave and the ones that you gave. It's easier to start with my cases.
                From what you said, I understand that I can trust equally my two cases.
                It was my question, I though I could be more confident with (100 reads for the reference and 100 reads for the variant) than with (3 reads for the reference and 3 reads for the variant) all other things equal because it is more likely to have 3 than 100 errors.

                Then, for the cases you mentioned, it's more complicated. But, it's the same idea. We calculate a proportion of variant and this proportion is probably more reliable if it has been estimated from a big sample, all other things equal.

                I'm studying the mutations occurring in cells of patients suffering from leukaemia. I am looking for somatic mutations which take place at homozygous position as a first study.
                I'm using tools like VarScan 2 and JointSNVMix for detection.
                I know that my samples have a purity of 1 (or very close to 1) but I shouldn't expect 0, 50 or 100% of variant because all my cells won't be mutated...

                So to filter my (big) list of variants, I use quality criterion and that is why I'm looking for publications about it.

                Comment


                • #9
                  > I'm using tools like VarScan 2 and JointSNVMix for detection.

                  Hello Jane M,

                  This might not be the right post but I was wondering if you would you like to share your experience with VarScan2, JointSNVMix? and Strelka? and others you might have tried it i.e. somatic sniper, muTect, etc...

                  Many thanks,

                  Rene L

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X