Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing Converage in Exome seq

    Hi guys,

    I am working on some exome datasets and ran into a problem:
    Is there already a published way of looking at sites which should have been captured but didn't be sequenced?
    E.g. we did exome sequencing of a patient and I did alignment, SNP calling (GATK) and afterwards variant annotation using Annovar. We are now down to 3 SNPs. While we are now trying to validate that experimentally I was wondering how to determine parts of the exome which are not covered by let's assume 20+ reads. Moreover, how could one determine homozygote exon deletions (they should not be covered by sequences at all, but that can happen due to chance as well)?

    Any help appreciated,
    Peter

  • #2
    Hi ulz_peter,

    We are also interested in this question. I'm surprised others have not asked this and are only looking for SNPs/indels in exome data

    I have not found a published package yet that does this.

    Hard Cutoffs for reads/exon are not safe since you'll get too many False positives for deletions if the capture did not go well or when going from one sample to another (unless you normalize reads across samples).

    However, I am in collaboration with someone who is working on this problem using a statistical approach
    Last edited by NGSfan; 01-31-2011, 06:30 AM.

    Comment


    • #3
      I haven't specifically looked into this problem but I have a feeling that BEDTools would contain functions (for example genomeCoverage) for calculating what you need.

      So something like calculate base-wise genome coverage of the exome data -> filter through the capture regions by discarding bases not in those regions -> filter out bases having coverage <20x.

      Comment


      • #4
        Originally posted by ulz_peter View Post
        Is there already a published way of looking at sites which should have been captured but didn't be sequenced?
        I'm not sure about a published way. I am working on a paper that will include this type of analysis (among many other things), so I can't go into detail yet, but I can suggest some things more generally.

        I personally wrote a perl script to do this myself. It basically calculates mean coverage across each targeted interval and generates a set of "% targets above mean coverage" metrics. It will be easy to tweak this script to instead output all the intervals below a specific coverage value (whatever you think equates to "too low" for variant calling--4X coverage or whatever).

        Just for your information, I compared multiple different exome pull-down platforms and saw various amounts of un-covered targets in each (ranging from ~1%-~7% missed).

        It appeared that even with additional sequencing, there are 1-2% of the targets in each platform that simply didn't pull down adequately and thereby failed.
        Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
        Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
        Projects: U87MG whole genome sequence [Website] [Paper]

        Comment


        • #5
          BEDTools does indeed have something. Below an excerpt from their histogram output. Column 5 is the coverage, and column 6 is how many bases were covered that well. Column 7 is the total length of the feature, column 8 is the % of the feature covered at that exact depth.

          So for the exon below, 8 bases were totally uncovered, and 122 bases were covered at 5x or greater.

          I made it with this command line, courtesy of another poster on Seqanswers

          coverageBed -abam reads.bam b exons.bed -hist >result.txt


          chr1 176160618 176161057 Spna1-41 0 8 439 0.0182232
          chr1 176160618 176161057 Spna1-41 1 59 439 0.1343964
          chr1 176160618 176161057 Spna1-41 2 92 439 0.2095672
          chr1 176160618 176161057 Spna1-41 3 103 439 0.2346241
          chr1 176160618 176161057 Spna1-41 4 55 439 0.1252847
          chr1 176160618 176161057 Spna1-41 5 21 439 0.0478360
          chr1 176160618 176161057 Spna1-41 6 49 439 0.1116173
          chr1 176160618 176161057 Spna1-41 7 27 439 0.0615034
          chr1 176160618 176161057 Spna1-41 8 25 439 0.0569476

          Comment


          • #6
            Originally posted by Michael.James.Clark View Post
            I'm not sure about a published way. I am working on a paper that will include this type of analysis (among many other things), so I can't go into detail yet, but I can suggest some things more generally.

            I personally wrote a perl script to do this myself. It basically calculates mean coverage across each targeted interval and generates a set of "% targets above mean coverage" metrics. It will be easy to tweak this script to instead output all the intervals below a specific coverage value (whatever you think equates to "too low" for variant calling--4X coverage or whatever).

            Just for your information, I compared multiple different exome pull-down platforms and saw various amounts of un-covered targets in each (ranging from ~1%-~7% missed).

            It appeared that even with additional sequencing, there are 1-2% of the targets in each platform that simply didn't pull down adequately and thereby failed.

            Yes, this is precisely the problem - you will always have 1-2% of target regions (eg. exons) that are not captured. If you have , say, 15000 target regions, then up to 300 regions you would false call as deletions. So you would have to have a way to figure out the "bad baits" and remove them from the analysis. The question is - across how many samples does a bait not have to work to call it bad? 5? 10 ? 20?

            This is a trickier problem in that I don't think thresholds like #reads/exon would work robustly.

            Variability in capture rates, sampling depths, etc would make it hard to use one cutoff consistently.. no?

            Just brainstorming here...
            Last edited by NGSfan; 02-03-2011, 10:16 AM.

            Comment


            • #7
              Just to give you a taste on what we have working (I cannot give the details since we have not published yet):

              We've got ~60 cell lines sequenced with targeted exonic capture.

              Our methods takes the bait locations, calculates the coverage and determines deleted regions. It does not use simple cutoffs, but looks at the distribution of normalized reads across all cell lines for each bait and determines what read count represents "copy number 2" (diploid) and what is less than (deletion) or greater than diploid (amplification or copy number). For now we can get the deletions reliably - but CNV counts are trickier. It's very preliminary though !

              So far CNV/deletion algorithms I know are only for whole genome sequencing.

              IGV tracks:

              Copy number/deletion prediction:
              Last edited by NGSfan; 02-16-2011, 09:25 AM.

              Comment


              • #8
                Is it the NCI 60 set?

                Comment


                • #9
                  No, it's our own sequenced exome capture from 60-80 different cell lines - it probably has considerable overlap though.

                  Although I didn't know there is exome capture data available for the NCI 60? Where would I find this?

                  Would be interesting to run it on that data.

                  Comment


                  • #10
                    I don't know if others have run into problems of not being able to get a hand on commercialized exome baits bed files. I spoke with a sales rep and he told me it is proprietary info. yikes.

                    Comment


                    • #11
                      Originally posted by csoong View Post
                      I don't know if others have run into problems of not being able to get a hand on commercialized exome baits bed files. I spoke with a sales rep and he told me it is proprietary info. yikes.
                      That's terrible. How can you validate if its working up to specs if you don't know what you are supposed to be capturing?

                      We used the Agilient mouse exome set, and they gave us a .bed file

                      Comment


                      • #12
                        Originally posted by csoong View Post
                        I don't know if others have run into problems of not being able to get a hand on commercialized exome baits bed files. I spoke with a sales rep and he told me it is proprietary info. yikes.
                        What company was this? Nimblegen? Illumina? Agilent?

                        That doesn't sound good - it is crucial to have the bait bed files - how else will you know what was probed and not probed?

                        Maybe the sales rep meant it was only for paying customers?

                        Comment


                        • #13
                          thats good to hear, at least now they can't say they arent giving out the bed files anymore

                          im using human sureselect all exon kit 50 mb

                          Comment


                          • #14
                            thanks to all you people, i finally will have my hands on the files after speaking with the company

                            Comment


                            • #15
                              Hooray, happy ending.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X