Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • too high number of variants for human whole exome sequencing

    Dear all,
    my lab performed whole exome sequencing of 15 human samples and I performed the variant calling. However, even after stringent filtering, I ended up with about 75.800 variants, (present in at least 1/15 samples, unique variants per sample: 35.000-42.000), Ti/Tv ratio=2.6. Compared to other studies, this number is too high. For example O'Rawe at al. http://genomemedicine.com/content/5/3/28 performed per-sample variant calling (not multi-sample variant calling like me) and checked the concordance between 5 variant calling pipelines. Considering only GATK and samtools, they found 26.323 variants in the call intersection (average over all samples, I think). I ask you all for your opinon on why I have this high number of variants with a low Ti/Tv ratio and ask for proposals what I can do about it. I realize that people often apply additional filters to extract a subset of interesting variants (e.g. only variants with AF < 0.05), but I think some of my variants must be false positives and I want to eliminate those before I do downstream analysis. Here are the necessary information, thank you very much in advance.
    • Samples: 4 samples are affected by familial PD, 6 samples are healthy (or yet undiagnosed) family member, 5 samples are unrelated healthy people
    • Exome Enrichment Kit: Illumina TruSeq (~62 Mb target region)
    • Sequenced on Illumina MiSeq, multiple runs, for each run, for each sample, alignment created with BWA and then for each sample merged with Picard
    • Coverage on 15 samples (not uniform between samples): coverage median 20X - 40X, about 40-60% of target region (TruSeq) covered by at least 30X
    • 1st processing: BWA + GATK best practise (MD, IR, BQSR, multi sample variant calling with UnifiedGenotyper only in target region + 20 bp offset, VQSR) --> ~ 124.000 variants
      then manual filter based on alternative allele count. Only accepted variants where at least one sample had alt allele frequency >= 0.35 and nr of reads that support alt allele >= 10 --> 98.000 variants remaining
    • 2nd processing: BWA + GATK MD, IR, BQSR + variant calling with samtools / bcftools only in target region + bcftools varfilter total min cov 60, max cov 1500 --> 109.000 variants
      then manual filter based on alternative allele count as for 1st processing. Nr of variants remaining --> 81.000
    • intersection of the 98.000 GATK variants and the 81.000 samtools variants --> ~ 75.800 variants that are present in at least one sample, unique variants per sample: 35.000-42.000

  • #2
    Hi evakoe,

    One explanation may be that the Illumina TruSeq exome (62Mb) includes untranslated regions, both 5' and 3', in the target regions - it has one of the largest target regions of all the exome enrichment kits. In the paper you mentioned, they used the Agilent SureSelect version 2 capture kit, which does not include these regions (although there is an extended version that does).

    Also, VQSR may not be the best approach for genotyping such a small cohort of samples. GATK recommends at least 30 exomes for this to work well. Perhaps try the alternative hard filter approach, also detailed in the best practice documents. There are many exome datasets from the 1000 Genomes Project that can be downloaded to increase you sample size, and to check the genotypes you get with published data.

    In general, I have found that each exome has 18-20,000 SNVs in the coding regions plus first/last 2bp of introns (i.e. synonymous, non-synonymous, nonsense and splice site).

    Also, does your group of samples have close family members that allow you to check the genotypes for mendelian consistency, i.e. parents and children? This could highlight if there are many false positives.
    Last edited by rbagnall; 10-21-2013, 08:41 PM.

    Comment


    • #3
      Hi rbagnall,

      thank you very much for your reply. I think you made a good point with the large enrichment region of the TruSeq kit. I will look into this further.

      I realize that 15 samples is too little for accurate genotyping, but I was always a bit hesitant to include 1000 Genomes samples, because of the technical differences in the dataset. I can imagine that one introduces different kind of biases that screw the results, I should probably still give it a shot.

      I did check for medelian consistency as far as my samples allowed (which wasn't very far though, since I have only one father and two sons that closely related), but there a problem wasn't obvious.

      Thanks again, I do appriciate your opinon.

      Comment


      • #4
        Hello again rbagnall and others,

        maybe I should have formulated my last post more like a question.

        1. Do you think that adding 1000Genomes samples to my cohort will introduce some biases to the variant calling due to the differences in data generation?

        2. I don't only have non-uniform coverage between the samples, but also non-uniform coverage across the target regions in one sample. Do you think this could be one reason for the high number of variants observed?

        Thank you

        Comment


        • #5
          Hi,

          Since you already have BAM files on your samples, it would take a couple of hours to run GATK universal genotyper with the hard filters, using the best practice guidelines. If the variant you are looking for has reasonable coverage in the 4 affected family members (autosomal dominant?) then this approach should call the variant. Since you also have some other non-affected family members, you may strike lucky and find what you are looking for without much further variant calling.

          If you want to use VQSR and add in extra 1000 Genomes exome data there are a couple of things I would consider.

          1. 1000 Genomes exome data was generated by Illumina GAIIX, Illumina HiSeq 2000, ABI SOLiD and 454 machines, and each has their own genotyping/error biases. Since you used MiSeq, I would lean towards exome data that has been generated using the HiSeq2000, rather than SOLiD or 454.


          2. The exome capture kits used by the 1000 Genomes varied during the project, reflecting the release of newer kits as they came out. I would try and use data captured with one of the kits with the larger regions, such as Nimblegen SeqCap EZ Human Exome v2.0, as used by Baylor College of Medicine (BCM). 1000 Genomes did not use the TruSeq kit, which you used, as far as I am aware. You should make a new target interval list that has the overlapping regions of your TruSeq targets and the Nimblegen SeqCap EZ Human Exome v2.0 (e.g. use Bedtools: intersectbed to do this). This will ensure that all of your target regions are present in your data and the 1000 Genomes data.


          See the supplementary information from the 1000 Genomes paper for more on this, particularly from page 14:




          3. You might also want to select samples that have the same ethnicity as your samples, and you can find this info from the 1000 Genomes website

          ftp://ftp.1000genomes.ebi.ac.uk/vol1...es_samples.xls

          If you are looking for european samples (CEPH, GBR) you can download the data from the short read archives, the index of which is here:



          But really, universal genotyper with hard filters may be a lot quicker, then look for rare, damaging variants shared by all affected family members, possibly in reasonable candidate genes for the disease. Then look at the variants (using IGV or your favourite viewer) to see if they seem real, before genotyping to show co-segregation with disease, and publishing before home time.
          Last edited by rbagnall; 10-23-2013, 09:45 PM.

          Comment


          • #6
            Dear rbagnall, thank you very much for your great suggestions.
            Eva

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM
            • seqadmin
              Recent Advances in Sequencing Technologies
              by seqadmin



              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

              Long-Read Sequencing
              Long-read sequencing has seen remarkable advancements,...
              12-02-2024, 01:49 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            42 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-11-2024, 07:45 AM
            0 responses
            42 views
            0 likes
            Last Post seqadmin  
            Working...
            X