Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with samtools/bcftools

    Hi

    I am trying to call SNPs and indels from some bam files and I tried the following commands as given in the samtools/bcftools manual:

    samtools mpileup -uf reference.fa aln1.bam | bcftools view -bvcg - > var.raw.bcf
    bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

    However after running the commands I see that var.raw.bcf and var.flt.vcf files have results from only the first chromsome that is present in the bam file. I also get the following message:

    [mpileup] 1 samples in 1 input files
    <mpileup> Set max per-file depth to 8000
    [afs] 0:16349.158 1:6949.156 2:48149.686

    In my bam file Chr10 appears first followed by Chr11, 12, ...19, 1, 20, 21..., 2 etc. The reads, however, are sorted within the chromosomes. Just to make sure that the problem is not due to such arrangement of the chromsomes, I extracted Chr1, 2 and 3 and created a new bam file and sorted it using following commands.

    samtools view -bh aln1.bam Chr1 Chr2 Chr3 -o Chr1to3.bam
    samtools sort Chr1to3.bam Chr1to3sorted

    Then ran the same command as before for SNP/Indel analysis. But the problem exists.

    Can any of you please suggest what might be the possible problems/solutions?

    Thanks in advance

    ALM

  • #2
    Try running it without chromosome 1 and see what happens. Make sure the headers for all files used are complete and the reference sequence file is complete as well.

    Comment


    • #3
      Hi,

      I have generated a VCF file using the command

      samtools faidx input.reference.fa
      samtools index sortedhg19.bam
      samtools mpileup -d 8000 -uf input.reference.fa my.sortedhg19.bam | bcftools view -bvcg -> my.raw.bcf
      bcftools view my.raw.bcf | perl vcfutils.pl varFilter -D8000 > my.flt.vcf

      But when i validated my VCF file using vcf-validator of VCFTools and i am getting the warning message as given below

      The header tag 'reference' not present. (Not required but highly recommended.)
      The header tag 'contig' not present for CHROM=chr1. (Not required but highly recommended.)
      INFO field at chr1:10007 .. Could not validate the float [-nan]

      Can any one please help me to fix this.I think only the third one is critical.

      Regards,
      Amruta

      Comment


      • #4
        Could you post that entire line in the file?

        Comment


        • #5
          ##fileformat=VCFv4.1
          ##samtoolsVersion=0.1.18 (r982:295)
          ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
          ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
          ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
          ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
          ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
          ##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
          ##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
          ##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
          ##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
          ##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
          ##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
          ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
          ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
          ##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
          ##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
          ##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
          ##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
          ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
          ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
          ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
          ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
          ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
          ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
          ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
          #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 29dec/sorted.29dec.hg19.bam.bam
          chr1 10007 . T C 6.98 . DP=127;VDB=-nan;AF1=0.5001;AC1=1;DP4=1,3,0,7;MQ=20;FQ=5.64;PV4=0.36,1,1,1 GT:PL:GQ 0/1:36,0,33:34
          chr1 10186 . T A 6.98 . DP=140;VDB=-nan;AF1=0.4999;AC1=1;DP4=2,8,5,7;MQ=20;FQ=9.53;PV4=0.38,1,1,1 GT:PL:GQ 0/1:36,0,76:37
          chr1 10549 . T A 9.52 . DP=14;VDB=-nan;AF1=0.5;AC1=1;DP4=2,1,1,2;MQ=20;FQ=9.52;PV4=1,1,1,1 GT:PL:GQ 0/1:39,0,39:39
          chr1 12613 . G C 6.02 . DP=6;VDB=-nan;AF1=1;AC1=2;DP4=0,0,0,2;MQ=20;FQ=-33 GT:PL:GQ 1/1:36,6,0:6
          chr1 13275 . G A,C 3.41 . DP=9;VDB=-nan;AF1=1;AC1=2;DP4=0,0,1,2;MQ=20;FQ=-33 GT:PL:GQ 1/1:43,17,11,31,0,28:4
          chr1 13407 . T A 6.02 . DP=10;VDB=-nan;AF1=1;AC1=2;DP4=0,0,0,2;MQ=20;FQ=-33 GT:PL:GQ 1/1:36,6,0:6
          chr1 13411 . C A 5.29 . DP=10;VDB=-nan;AF1=1;AC1=2;DP4=0,0,1,2;MQ=20;FQ=-33 GT:PL:GQ 1/1:35,6,0:6
          chr1 16259 . T A 6.02 . DP=8;VDB=-nan;AF1=1;AC1=2;DP4=0,0,2,0;MQ=20;FQ=-33 GT:PL:GQ 1/1:36,6,0:6
          chr1 17216 . T G 3.64 . DP=6;VDB=-nan;AF1=0.5205;AC1=1;DP4=1,0,1,1;MQ=20;FQ=-16.1;PV4=1,0.03,1,0.44 GT:PL:GQ 0/1:31,0,11:18
          chr1 18712 . T A 3.54 . DP=15;VDB=-nan;AF1=0.4998;AC1=1;DP4=3,1,2,3;MQ=20;FQ=5.35;PV4=0.52,0.042,1,1 GT:PL:GQ 0/1:31,0,45:30
          chr1 20700 . C G 3.64 . DP=9;VDB=-nan;AF1=0.5205;AC1=1;DP4=1,0,1,1;MQ=20;FQ=-16.1;PV4=1,1,1,0.28 GT:PL:GQ 0/1:31,0,11:18
          chr1 20794 . C A 3.64 . DP=17;VDB=-nan;AF1=0.5205;AC1=1;DP4=1,0,1,1;MQ=20;FQ=-16.1;PV4=1,0.056,1,0.2 GT:PL:GQ 0/1:31,0,11:18
          chr1 21113 . T A 3.64 . DP=9;VDB=-nan;AF1=0.5205;AC1=1;DP4=0,1,1,1;MQ=20;FQ=-16.1;PV4=1,0.33,1,1 GT:PL:GQ 0/1:31,0,11:18
          chr1 21130 . C A 7.39 . DP=11;VDB=-nan;AF1=0.5718;AC1=1;DP4=0,1,2,3;MQ=20;FQ=-21;PV4=1,0.023,1,0.44 GT:PL:GQ 0/1:36,0,6:10
          chr1 21186 . G A 6.02 . DP=9;VDB=-nan;AF1=1;AC1=2;DP4=0,0,1,1;MQ=20;FQ=-33 GT:PL:GQ 1/1:36,6,0:6
          chr1 21253 . G A 8.64 . DP=24;VDB=-nan;AF1=0.5;AC1=1;DP4=5,2,3,1;MQ=20;FQ=11.3;PV4=1,0.1,1,0.1 GT:PL:GQ 0/1:38,0,76:40
          chr1 21300 . C A 3.54 . DP=19;VDB=-nan;AF1=0.4998;AC1=1;DP4=4,2,1,2;MQ=20;FQ=5.47;PV4=0.52,0.34,1,1 GT:PL:GQ 0/1:31,0,70:30
          chr1 21415 . C A 5.46 . DP=16;VDB=-nan;AF1=0.4999;AC1=1;DP4=3,2,1,2;MQ=20;FQ=7.77;PV4=1,0.41,1,0.26 GT:PL:GQ 0/1:34,0,55:34
          chr1 21564 . C A 5.29 . DP=4;VDB=-nan;AF1=1;AC1=2;DP4=0,0,2,1;MQ=20;FQ=-33 GT:P



          This is the VCF file.Hope you asked for vcf file .
          Is there any way to format reference fasta sequence?
          i have downloaded individual chromosomes from 1,2...22,X and y and joined into a single file in two orders like 1,11,12,13 ... and 1,2,3,4,.. .But in both cases this error is coming after validation

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          51 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          67 views
          0 likes
          Last Post seqadmin  
          Working...
          X