Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mapping to de novo scaffolds: good BAMs but empty VCFs. Why?

    I am unable to recover variants that I know exist when mapping to de novo assembled scaffolds, even though the BAMs (when checked in a genome browser (Artemis)) look great.

    I am mapping using bwa-0.6.1 and SAMtools for variants (indels and SNPs) to a non-standard reference, some assembled scaffolds I made myself in Velvet + SSPACE.

    Using standard methods, the VCFs end-up empty:

    Code:
    samtools mpileup -Euf scaffolds.fasta myfile_sorted.bam | bcftools view -bvcg -> myfile.bcf
    
    bcftools view myfile.bcf | vcfutils.pl varFilter > myfile.vcf
    Bypassing BCF/vcfultils and going right to a VCF finds SNPs, but the reference is always "N" (even though I know these regions have actual sequence, not NNNNN connectors).

    Example:

    Code:
    ##fileformat=VCFv4.1
    ##samtoolsVersion=0.1.18 (r982:295)
    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	
    scaf.1	74	.	N	A,X	0	.	DP=111;I16=0,0,98,13,0,0,4267,165509,0,0,6133,352697,0,0,1284,21070;VDB=0.0808	PL	255,255,0,255,255,255
    scaf.1	76	.	N	C,X	0	.	DP=119;I16=0,0,105,14,0,0,4564,176946,0,0,6582,378738,0,0,1252,19002;VDB=0.0808	PL	255,255,0,255,255,255
    scaf.1	77	.	N	A,X	0	.	DP=129;I16=0,0,115,14,0,0,4881,186413,0,0,7182,414738,0,0,1242,18140;VDB=0.0801	PL	255,255,0,255,255,255
    scaf.1	78	.	N	A,X	0	.	DP=129;I16=0,0,115,13,0,0,4844,184614,0,0,7153,413897,0,0,1242,17464;VDB=0.0801	PL	255,255,0,255,255,255
    scaf.1	79	.	N	A,X	0	.	DP=138;I16=0,0,124,13,0,0,5152,195276,0,0,7662,443538,0,0,1246,16986;VDB=0.0801	PL	255,255,0,255,255,255
    scaf.1	80	.	N	A,X	0	.	DP=144;I16=0,0,131,13,0,0,5409,205033,0,0,8113,471497,0,0,1260,16742;VDB=0.0801	PL	255,255,0,255,255,255
    scaf.1	81	.	N	A,X	0	.	DP=143;I16=0,0,136,7,0,0,5411,206159,0,0,8239,484451,0,0,1236,16254;VDB=0.0801	PL	255,255,0,255,255,255
    scaf.1	82	.	N	G,T,X	0	.	DP=144;I16=0,0,136,8,0,0,5495,210839,0,0,8299,488051,0,0,1271,16481;VDB=0.0808	PL	255,255,0,255,255,255,255,255,255,255
    scaf.1	83	.	N	A,X	0	.	DP=146;I16=0,0,138,8,0,0,5587,214799,0,0,8419,495251,0,0,1307,16845;VDB=0.0808	PL	255,255,0,255,255,255
    scaf.1	84	.	N	A,X	0	.	DP=145;I16=0,0,137,8,0,0,5586,216040,0,0,8359,491651,0,0,1347,17347;VDB=0.0808	PL	255,255,0,255,255,255
    scaf.1	85	.	N	G,X	0	.	DP=151;I16=0,0,141,10,0,0,5817,225869,0,0,8688,510492,0,0,1390,18038;VDB=0.0808	PL	255,255,0,255,255,255
    scaf.1	86	.	N	G,X	0	.
    Looking at the BAM files with the scaffold reference in a browser and visualizing SNPs, I see good coverage and the SNPs are pretty obvious. Searching the no-BCF VCF files (example above) for positions in the reference where the sequence is good and I can see SNPs, the position is called as a SNP and the alternative allele is correct, but the reference is still N and Quality is 0.

    Using the same pipeline with a standard reference recovers SNPs in the VCF just as expected.
    Last edited by Genomics101; 11-27-2012, 10:30 AM. Reason: Clarity

  • #2
    Two quick things to check:

    1) remake the fasta index on your reference fasta with samtools faidx. Does it complete with no errors?

    2) double check that the names in your reference fasta match the names in your .bam. Maybe there are weird spaces or symbols that are being truncated in the .bam?

    Comment


    • #3
      @swbarnes2

      Thanks very much. faidx completes without errors. but I don't think my pipeline was making the .fai files as it should have been (I think they were empty). Making them separately this way makes nice big indices and I'll re-run the analyses with these. Any idea why samtools wasn't making the .fai properly? Thanks again!

      Later that same day....

      It worked! I still don't know why SAMtools was't making the .fais as it normally does automatically when I make the mpileup, but happy to have things working.
      Last edited by Genomics101; 11-27-2012, 12:19 PM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Advancing Precision Medicine for Rare Diseases in Children
        by seqadmin




        Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
        12-16-2024, 07:57 AM
      • seqadmin
        Recent Advances in Sequencing Technologies
        by seqadmin



        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

        Long-Read Sequencing
        Long-read sequencing has seen remarkable advancements,...
        12-02-2024, 01:49 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 12-17-2024, 10:28 AM
      0 responses
      27 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-13-2024, 08:24 AM
      0 responses
      43 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-12-2024, 07:41 AM
      0 responses
      29 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-11-2024, 07:45 AM
      0 responses
      42 views
      0 likes
      Last Post seqadmin  
      Working...
      X