Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • missing and spurious varinats calling by bwa-samtools-vcftools

    Hello:
    I am trying to discover SNPs among 24 samples. From all sequence reads combined I have constructed a reference genome (about 205 Mb) and then aligned the individual reads back to the reference genome with bwa-0.6.1. Thus for sample "S3" the commands looked like this.

    bwa aln -t 2 Reference02_singleChromosome_index S03FR-allCT.fastq > aln_S03FR-CT_Ref02.sai
    bwa samse Reference02_singleChromosome_index aln_S03FR-CT_Eg-Ref02.sai S03FR-allCT.fastq > S03FR-CT_Ref02.sam

    After bwa, I used samtools-0.1.18 to call variants:
    samtools view -uS S03FR-CT_Ref02.sam > S03FR-CT_Ref02.bam
    samtools sort S03FR-CT_Ref02.bam S03FR-CT_Ref02-sorted
    samtools index S03FR-CT_Ref02-sorted.bam

    So far so good , except that about 40% of the reads could not be aligned... I wonder where they went .

    I did so for all 24 samples. Then with mpileup
    samtools mpileup -Buf Reference02_singleChromosome.fasta S03FR-CT_Eg-Ref02-sorted.bam S04...all 24 samples | bcftools view -bvcg - > 24_samples.raw.bcf

    bcftools view 24_samples.raw.bcf | perl vcfutils.pl varFilter -D100 > 24_samples.flt.vcf

    I used IGV-2.0.34 to view the vcf file and the bam files.

    What I noticed is that many genuine variants that were called by bwa-samtools and are in the individual -sorted.bam files have not been called by vcftools. The attached file igv_snapshot1.png shows an area where some SNPs are called by both methods... but then next to it are 4 SNPs that are correctly called in the individual alignments, but they have been dropped from the .vcf output. File igv_snapshot2.png shows a SNP that is called heterozygous for all 24 samples in the .vcf output... but actually is based on a very small number of sequence reads, only 4 out of 24 samples are truly heterozygous, for all the other samples there is really no data at all.

    What can I improve so that the missing SNPs will show up?
    Is there an option in mpileup / vcftools to allow and indicate missing data? It seems to me that the program imputes genotypes for many samples where it actually should not, but should indicate missing data.

    Thanks for any help

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 11:49 AM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X