Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • exome/vcf merge question

    I research rare mendelian diseases and generally look for shared variants between my related affected samples. Because of variation in the quality and coverage of exomes, I want to be able to look at a merged variant file for all of my affected cases.

    This is relatively straight forward with vcf tools however if for example I am looking at the shared variants of two individuals, some variants may not be shared because of three reasons: 1. one individual has variant allele and one has wild-type i.e. not shared. 2) variant not covered on second exome 3) variant covered on second exome but at allele frequency below cutoff for variant call.

    Can anyone help with a method to annotate the merged variant list with the depth and allele call for each of the samples including calling the wild-type allele where true.

    p.s I had thought of using bedtools to annotate the depth of read at each start position but this wouldnt help me annotate with wildtype calls.

    Thanks

    Josh

  • #2
    I have a similar problem. I get my sequence data in batches (not all samples at once) and would like to have a running list of variants called on samples thus far.

    As you said, the biggest issue with straightforward merging of VCFs is that we need to differentiate between
    • evidence of absence ("there is sufficient depth at this locus and this sample is reference homozygous") and
    • absence of evidence ("this sample does not have enough coverage to infer whether there is a variant at this locus").

    I am still searching for solutions and will post if I find one.
    Kamalakar Gulukota,
    Director,
    Center for Bioinformatics and Computational Biology
    NorthShore University Health System, [email protected]

    Comment


    • #3
      Re: Create a VCF with your first bam file, say 1.vcf

      OK. There is a 3-step procedure that can accomplish what you want (I think).

      Step 1. Create VCF's with your first and second bam files separately, say old.vcf and new.vcf.

      Step 2. Next create a combined vcf with the two. I used the CombineVariants walker in GATK like so:
      PHP Code:
      java -jar GenomeAnalysisTK.jar -T CombineVariants -R GRCh37.fa --variant old.vcf --variant new.vcf -o joined.vcf -genotypeMergeOptions  UNIQUIFY 
      But presumably you can do similar with bedtools.

      Step 3. Finally, run the GATK UnifiedGenotyper by using the joined vcf as the target file i.e. with the -L option, like so:

      PHP Code:
      java -jar GenomeAnalysisTK.jar  -T UnifiedGenotyper -R GRCh37.fa -L joined.vcf -I old.bam -new.bam -final.vcf 
      I have combined 30 old bams with 50 new bams using this method and seems to work well.

      However, allow me to hasten to add that the best practice would be to run variant calling on all samples together. The above procedure might be quick and dirty. I think it will be mostly accurate but there will be differences between this procedure and redoing the whole shebang.
      Kamalakar Gulukota,
      Director,
      Center for Bioinformatics and Computational Biology
      NorthShore University Health System, [email protected]

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 08:47 AM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      59 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Working...
      X