Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Hkins552
    Member
    • Jun 2011
    • 18

    Samtools mpileup creates extra large file after local realignment

    Hi All,
    Typically when running mpileup, my variants.raw.vcf files run around 4-6G (whole exome resequencing). I recently began using local realignment and recalibration of quality score with GATK and now the .vcf files from mpileup are 25-45G. What would be causing this?
  • Richard Finney
    Senior Member
    • Feb 2009
    • 701

    #2
    Show the actual command you use.

    Check your bcftools flags. Do you want all locations? Or just variants?
    Last edited by Richard Finney; 07-21-2011, 10:39 AM. Reason: typo fixed

    Comment

    • Hkins552
      Member
      • Jun 2011
      • 18

      #3
      $ samtools mpileup -uf /hg19.fasta input.sorted.rmdup.reordered.realigned.recalibrated.bam > input.variants.raw

      I would like to call just variants, not all loci.

      I do not pipe directly to bcftools like the manual, but later use this command on the file from above:

      $ bcftools view -bvcg input.variants.raw > input.variants.raw.bcf

      $ bcftools view input.variants.raw.bcf | vcfutils.pl varFilter -d 3 -D 1000 -G 20 > input.variants.flt.vcf
      Last edited by Hkins552; 07-21-2011, 10:43 AM. Reason: added info

      Comment

      • oiiio
        Senior Member
        • Jan 2011
        • 105

        #4
        Have you already taken a look at the lines of your smaller and larger file? Are there differences?

        Maybe worth mentioning is using the unix 'comm' command, it will compare the two files for you.

        Comment

        • swbarnes2
          Senior Member
          • May 2008
          • 910

          #5
          I do exome capture, and I use bedtools to filter my .bams against the capture probe .bed file.

          That might help. It should winnow out some false aligning.

          Comment

          • Michael.James.Clark
            Senior Member
            • Apr 2009
            • 207

            #6
            How many lines are there in input.variants.raw.bcf? In input.variants.flt.vcf?

            Using the GATK pipeline, my exome-seq VCF files are on the order of 50-80,000 variants (lines) and a size of around 10-20Mb (depending on platform used for enrichment).

            Originally posted by swbarnes2 View Post
            I do exome capture, and I use bedtools to filter my .bams against the capture probe .bed file.

            That might help. It should winnow out some false aligning.
            If you do this, I advise you to use a modified capture probe .bed file where you've added 50 or 100 bases (or however much) to the end of each target region. You'll drop a lot of good data if you cut off right at the boundaries of the target intervals.
            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
            Projects: U87MG whole genome sequence [Website] [Paper]

            Comment

            • swbarnes2
              Senior Member
              • May 2008
              • 910

              #7
              Originally posted by Michael.James.Clark View Post
              If you do this, I advise you to use a modified capture probe .bed file where you've added 50 or 100 bases (or however much) to the end of each target region. You'll drop a lot of good data if you cut off right at the boundaries of the target intervals.
              I'm pretty sure BEDTools includes reads that hang off the edge of your target, so you can still call SNPs that are just off target. But yes, I usually align to padded targets, to be sure, though I generally count coverage against unpadded target.

              We use agilent capturing, and from the non-random sizes of the targets, I think the targets must be padded as well, with respect to the exons.

              Comment

              • Michael.James.Clark
                Senior Member
                • Apr 2009
                • 207

                #8
                Originally posted by swbarnes2 View Post
                I'm pretty sure BEDTools includes reads that hang off the edge of your target, so you can still call SNPs that are just off target. But yes, I usually align to padded targets, to be sure, though I generally count coverage against unpadded target.
                Yeah, I think that's fair for coverage to be sure, but for variant calling I typically run GATK with an intervalsList containing the targets +50bp on each end.

                We use agilent capturing, and from the non-random sizes of the targets, I think the targets must be padded as well, with respect to the exons.
                With respect to the exons that's typically true (with regards to RefSeq at least). The Agilent baits tend to extend outside the exons a bit. Still, you do end up pulling down significant excess in flanking regions as expected, so you gain even more. This is why I typically see a lot more variants than expected for exome alone in one of these experiments.
                Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                Projects: U87MG whole genome sequence [Website] [Paper]

                Comment

                Latest Articles

                Collapse

                • GATTACAT
                  Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by GATTACAT
                  Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                  07-01-2026, 11:43 AM
                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, Yesterday, 11:08 AM
                0 responses
                6 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-30-2026, 05:37 AM
                0 responses
                11 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-26-2026, 11:10 AM
                0 responses
                19 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                53 views
                0 reactions
                Last Post SEQadmin2  
                Working...