Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Samtools mpileup creates extra large file after local realignment

    Hi All,
    Typically when running mpileup, my variants.raw.vcf files run around 4-6G (whole exome resequencing). I recently began using local realignment and recalibration of quality score with GATK and now the .vcf files from mpileup are 25-45G. What would be causing this?

  • #2
    Show the actual command you use.

    Check your bcftools flags. Do you want all locations? Or just variants?
    Last edited by Richard Finney; 07-21-2011, 10:39 AM. Reason: typo fixed

    Comment


    • #3
      $ samtools mpileup -uf /hg19.fasta input.sorted.rmdup.reordered.realigned.recalibrated.bam > input.variants.raw

      I would like to call just variants, not all loci.

      I do not pipe directly to bcftools like the manual, but later use this command on the file from above:

      $ bcftools view -bvcg input.variants.raw > input.variants.raw.bcf

      $ bcftools view input.variants.raw.bcf | vcfutils.pl varFilter -d 3 -D 1000 -G 20 > input.variants.flt.vcf
      Last edited by Hkins552; 07-21-2011, 10:43 AM. Reason: added info

      Comment


      • #4
        Have you already taken a look at the lines of your smaller and larger file? Are there differences?

        Maybe worth mentioning is using the unix 'comm' command, it will compare the two files for you.

        Comment


        • #5
          I do exome capture, and I use bedtools to filter my .bams against the capture probe .bed file.

          That might help. It should winnow out some false aligning.

          Comment


          • #6
            How many lines are there in input.variants.raw.bcf? In input.variants.flt.vcf?

            Using the GATK pipeline, my exome-seq VCF files are on the order of 50-80,000 variants (lines) and a size of around 10-20Mb (depending on platform used for enrichment).

            Originally posted by swbarnes2 View Post
            I do exome capture, and I use bedtools to filter my .bams against the capture probe .bed file.

            That might help. It should winnow out some false aligning.
            If you do this, I advise you to use a modified capture probe .bed file where you've added 50 or 100 bases (or however much) to the end of each target region. You'll drop a lot of good data if you cut off right at the boundaries of the target intervals.
            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
            Projects: U87MG whole genome sequence [Website] [Paper]

            Comment


            • #7
              Originally posted by Michael.James.Clark View Post
              If you do this, I advise you to use a modified capture probe .bed file where you've added 50 or 100 bases (or however much) to the end of each target region. You'll drop a lot of good data if you cut off right at the boundaries of the target intervals.
              I'm pretty sure BEDTools includes reads that hang off the edge of your target, so you can still call SNPs that are just off target. But yes, I usually align to padded targets, to be sure, though I generally count coverage against unpadded target.

              We use agilent capturing, and from the non-random sizes of the targets, I think the targets must be padded as well, with respect to the exons.

              Comment


              • #8
                Originally posted by swbarnes2 View Post
                I'm pretty sure BEDTools includes reads that hang off the edge of your target, so you can still call SNPs that are just off target. But yes, I usually align to padded targets, to be sure, though I generally count coverage against unpadded target.
                Yeah, I think that's fair for coverage to be sure, but for variant calling I typically run GATK with an intervalsList containing the targets +50bp on each end.

                We use agilent capturing, and from the non-random sizes of the targets, I think the targets must be padded as well, with respect to the exons.
                With respect to the exons that's typically true (with regards to RefSeq at least). The Agilent baits tend to extend outside the exons a bit. Still, you do end up pulling down significant excess in flanking regions as expected, so you gain even more. This is why I typically see a lot more variants than expected for exome alone in one of these experiments.
                Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                Projects: U87MG whole genome sequence [Website] [Paper]

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X