Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strangely high proportion of variants are INDELs

    I am analysing sequencing data (exome and region capture) of ENU-mutagenised mice. The most recent sample is a capture of ~28Mb region where the mutation has been previously mapped to. After running the sample through our pipeline I found that approx. 60% of variants called within the target region were INDELs compared to approx. 30% for similar-ish exome sequencing projects. All samples are sequencing on the Illumina GAII or HiSeq.

    Our pipeline is roughly as follows:

    Alignment with quality score recalibration with Novoalign -> remove multimapping reads -> duplicate removal with Picard's MarkDuplicates -> Variant calling with mpileup and bcftools

    with mpileup and bcftools parameters as follows:

    samtools mpileup -q1 -C50 -d10000 -L10000 -ugf $reference $bamfile | bcftools view -bvcg - > $bcffile
    bcftools view $bcffile | vcfutils.pl varFilter -D10000 -w0 -W0


    I realise there are considerable differences between my region capture sample and my comparison exome samples (e.g. different capture platforms, generally higher coverage for region capture), however this result still has me a little concerned.

    Would local realignment (such as that offered in GATK) be advisable to resolve the status of these putative INDELs?
    Does this INDEL proportion strike others as too high?

    Any suggestions would be welcome.
    Pete

  • #2
    Try redoing the mpileup with -B or -E.

    What can happen without those options is the BAQ calculations, which are on by default, sometimes see a real SNP, and decides that its not a SNP, it's an error due to an undiagnosed indel, and mpileup will drop the quality of each of those letters down to nothing, and then the SNP caller won't call it due to high quality. I've seen this happen with a few sanger verified SNPs. But it won't do this for indels, so this can lead to a huge number of indels as compared to simple SNPs.

    If you turn off those calculations with -B, or modify them with -E, more real simple SNPs may surface, which may give you a more believable ratio.

    Comment


    • #3
      indel analysis

      Swbarnes2, about your comment, I am analysing indels using mpileup/bcftools. I had problems to run for all chromosomes together, even for one chromosome (the analyses stopped in the middle of one chromosome), and someone gave me the advice to use the option -B.
      But in this case do you think I can have many false-positive indels? I am not using the option -E as well. Only some filtration based on quality scores (q and Q).

      My command line is
      samtools mpileup –q20 –Q20 –AB -ugf reference.fa file.bam | bcftools view -bvcg - > var.raw.bcf
      bcftools view var.raw.bcf | vcfutils.pl varFilter -D99999 > var.flt.vcf

      But if I will not use -B, I cannot run my analysis...and I have many individuals to run…

      Thanks

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      30 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X