Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mapping to de novo scaffolds: good BAMs but empty VCFs. Why?

    I am unable to recover variants that I know exist when mapping to de novo assembled scaffolds, even though the BAMs (when checked in a genome browser (Artemis)) look great.

    I am mapping using bwa-0.6.1 and SAMtools for variants (indels and SNPs) to a non-standard reference, some assembled scaffolds I made myself in Velvet + SSPACE.

    Using standard methods, the VCFs end-up empty:

    Code:
    samtools mpileup -Euf scaffolds.fasta myfile_sorted.bam | bcftools view -bvcg -> myfile.bcf
    
    bcftools view myfile.bcf | vcfutils.pl varFilter > myfile.vcf
    Bypassing BCF/vcfultils and going right to a VCF finds SNPs, but the reference is always "N" (even though I know these regions have actual sequence, not NNNNN connectors).

    Example:

    Code:
    ##fileformat=VCFv4.1
    ##samtoolsVersion=0.1.18 (r982:295)
    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	
    scaf.1	74	.	N	A,X	0	.	DP=111;I16=0,0,98,13,0,0,4267,165509,0,0,6133,352697,0,0,1284,21070;VDB=0.0808	PL	255,255,0,255,255,255
    scaf.1	76	.	N	C,X	0	.	DP=119;I16=0,0,105,14,0,0,4564,176946,0,0,6582,378738,0,0,1252,19002;VDB=0.0808	PL	255,255,0,255,255,255
    scaf.1	77	.	N	A,X	0	.	DP=129;I16=0,0,115,14,0,0,4881,186413,0,0,7182,414738,0,0,1242,18140;VDB=0.0801	PL	255,255,0,255,255,255
    scaf.1	78	.	N	A,X	0	.	DP=129;I16=0,0,115,13,0,0,4844,184614,0,0,7153,413897,0,0,1242,17464;VDB=0.0801	PL	255,255,0,255,255,255
    scaf.1	79	.	N	A,X	0	.	DP=138;I16=0,0,124,13,0,0,5152,195276,0,0,7662,443538,0,0,1246,16986;VDB=0.0801	PL	255,255,0,255,255,255
    scaf.1	80	.	N	A,X	0	.	DP=144;I16=0,0,131,13,0,0,5409,205033,0,0,8113,471497,0,0,1260,16742;VDB=0.0801	PL	255,255,0,255,255,255
    scaf.1	81	.	N	A,X	0	.	DP=143;I16=0,0,136,7,0,0,5411,206159,0,0,8239,484451,0,0,1236,16254;VDB=0.0801	PL	255,255,0,255,255,255
    scaf.1	82	.	N	G,T,X	0	.	DP=144;I16=0,0,136,8,0,0,5495,210839,0,0,8299,488051,0,0,1271,16481;VDB=0.0808	PL	255,255,0,255,255,255,255,255,255,255
    scaf.1	83	.	N	A,X	0	.	DP=146;I16=0,0,138,8,0,0,5587,214799,0,0,8419,495251,0,0,1307,16845;VDB=0.0808	PL	255,255,0,255,255,255
    scaf.1	84	.	N	A,X	0	.	DP=145;I16=0,0,137,8,0,0,5586,216040,0,0,8359,491651,0,0,1347,17347;VDB=0.0808	PL	255,255,0,255,255,255
    scaf.1	85	.	N	G,X	0	.	DP=151;I16=0,0,141,10,0,0,5817,225869,0,0,8688,510492,0,0,1390,18038;VDB=0.0808	PL	255,255,0,255,255,255
    scaf.1	86	.	N	G,X	0	.
    Looking at the BAM files with the scaffold reference in a browser and visualizing SNPs, I see good coverage and the SNPs are pretty obvious. Searching the no-BCF VCF files (example above) for positions in the reference where the sequence is good and I can see SNPs, the position is called as a SNP and the alternative allele is correct, but the reference is still N and Quality is 0.

    Using the same pipeline with a standard reference recovers SNPs in the VCF just as expected.
    Last edited by Genomics101; 11-27-2012, 10:30 AM. Reason: Clarity

  • #2
    Two quick things to check:

    1) remake the fasta index on your reference fasta with samtools faidx. Does it complete with no errors?

    2) double check that the names in your reference fasta match the names in your .bam. Maybe there are weird spaces or symbols that are being truncated in the .bam?

    Comment


    • #3
      @swbarnes2

      Thanks very much. faidx completes without errors. but I don't think my pipeline was making the .fai files as it should have been (I think they were empty). Making them separately this way makes nice big indices and I'll re-run the analyses with these. Any idea why samtools wasn't making the .fai properly? Thanks again!

      Later that same day....

      It worked! I still don't know why SAMtools was't making the .fais as it normally does automatically when I make the mpileup, but happy to have things working.
      Last edited by Genomics101; 11-27-2012, 12:19 PM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 11:49 AM
      0 responses
      15 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-24-2024, 08:47 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      61 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Working...
      X