Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Samtools mpileup creates extra large file after local realignment

    Hi All,
    Typically when running mpileup, my variants.raw.vcf files run around 4-6G (whole exome resequencing). I recently began using local realignment and recalibration of quality score with GATK and now the .vcf files from mpileup are 25-45G. What would be causing this?

  • #2
    Show the actual command you use.

    Check your bcftools flags. Do you want all locations? Or just variants?
    Last edited by Richard Finney; 07-21-2011, 10:39 AM. Reason: typo fixed

    Comment


    • #3
      $ samtools mpileup -uf /hg19.fasta input.sorted.rmdup.reordered.realigned.recalibrated.bam > input.variants.raw

      I would like to call just variants, not all loci.

      I do not pipe directly to bcftools like the manual, but later use this command on the file from above:

      $ bcftools view -bvcg input.variants.raw > input.variants.raw.bcf

      $ bcftools view input.variants.raw.bcf | vcfutils.pl varFilter -d 3 -D 1000 -G 20 > input.variants.flt.vcf
      Last edited by Hkins552; 07-21-2011, 10:43 AM. Reason: added info

      Comment


      • #4
        Have you already taken a look at the lines of your smaller and larger file? Are there differences?

        Maybe worth mentioning is using the unix 'comm' command, it will compare the two files for you.

        Comment


        • #5
          I do exome capture, and I use bedtools to filter my .bams against the capture probe .bed file.

          That might help. It should winnow out some false aligning.

          Comment


          • #6
            How many lines are there in input.variants.raw.bcf? In input.variants.flt.vcf?

            Using the GATK pipeline, my exome-seq VCF files are on the order of 50-80,000 variants (lines) and a size of around 10-20Mb (depending on platform used for enrichment).

            Originally posted by swbarnes2 View Post
            I do exome capture, and I use bedtools to filter my .bams against the capture probe .bed file.

            That might help. It should winnow out some false aligning.
            If you do this, I advise you to use a modified capture probe .bed file where you've added 50 or 100 bases (or however much) to the end of each target region. You'll drop a lot of good data if you cut off right at the boundaries of the target intervals.
            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
            Projects: U87MG whole genome sequence [Website] [Paper]

            Comment


            • #7
              Originally posted by Michael.James.Clark View Post
              If you do this, I advise you to use a modified capture probe .bed file where you've added 50 or 100 bases (or however much) to the end of each target region. You'll drop a lot of good data if you cut off right at the boundaries of the target intervals.
              I'm pretty sure BEDTools includes reads that hang off the edge of your target, so you can still call SNPs that are just off target. But yes, I usually align to padded targets, to be sure, though I generally count coverage against unpadded target.

              We use agilent capturing, and from the non-random sizes of the targets, I think the targets must be padded as well, with respect to the exons.

              Comment


              • #8
                Originally posted by swbarnes2 View Post
                I'm pretty sure BEDTools includes reads that hang off the edge of your target, so you can still call SNPs that are just off target. But yes, I usually align to padded targets, to be sure, though I generally count coverage against unpadded target.
                Yeah, I think that's fair for coverage to be sure, but for variant calling I typically run GATK with an intervalsList containing the targets +50bp on each end.

                We use agilent capturing, and from the non-random sizes of the targets, I think the targets must be padded as well, with respect to the exons.
                With respect to the exons that's typically true (with regards to RefSeq at least). The Agilent baits tend to extend outside the exons a bit. Still, you do end up pulling down significant excess in flanking regions as expected, so you gain even more. This is why I typically see a lot more variants than expected for exome alone in one of these experiments.
                Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                Projects: U87MG whole genome sequence [Website] [Paper]

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                33 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                34 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X