Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    We have a lot of comparisons of DINDEL, complete genomes, samtools, and other advanced indels caller, for deep whole genomes, 1000x sample low-pass whole genomes (1000g) and multi-sample exomes. Guillermo del Angel, who is writing the indel caller [plus a recalibrator and filtering program] as well as developing evaluation models, just submitted our 1000G calls (like 5 minutes ago). Over the next few weeks we'll be pushing a lot of our slide decks up to our public dropbox (have a look at the GSA wiki for the link) and there'll be a treasure-trove of analyses, evaluation material, etc. soon. All of these tools -- including our own -- are doing an ok job at indel calling, but there's still a long way to do before indels are as well handled as SNPs.

    Glad everyone is enjoying the GATK! If you want to see some crazy fun software engineering -- the slide archive has a presentation on automated distributed parallelism in the GATK, which is live in the codebase. I'd be very interested to hear if people start using this. Also, the engine as of yesterday is doing AWS S3 distributed logging so we should soon be aware of bugs as soon as they occur, anywhere in the world.

    Comment


    • #17
      Mark, thanks a lot for your updates here (and for your previous help on your GetSatisfaction page). Your project is of course quite valuable to the rest of us all. I'm looking forward to those slides.

      For everyone's information, I did just run the DINDEL function through Unified Genotyper on an exome dataset.
      It came out with about 10% the number of indels as it previously found SNPs. This is many fewer than the Dindel program using default settings. However, I generally expect to find 10% the number of indels as SNPs based on past research (see 1001 publications about the topic!), so I'm actually pleased with the number (because I found about 36k SNVs, so 3-4k indels seems reasonable).

      Since we've got you here, I wanted to ask about multi-threading in GATK. I've tried using the "nt" parameter unsuccessfully at many of the steps (set to 8 threads) even though the --help says it can use nt. Do any of the tools actually use that function?

      Also, at the end I'm trying to annotate overlap with Refseq.
      There's a page on the wiki about using the VariantAnnotator tool with RefSeq (here: http://www.broadinstitute.org/gsa/wiki/index.php/RefSeq). I tried this and failed.
      Then I found the pages about using Refseq with the GenomicAnnotator tool (http://www.broadinstitute.org/gsa/wi...nomicAnnotator). I guess all I'm asking is whether that VariantAnnotator page is basically inaccurate and if GenomicAnnotator is the default tool for annotating overlaps that way now?

      Thanks again.
      Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
      Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
      Projects: U87MG whole genome sequence [Website] [Paper]

      Comment


      • #18
        Good to hear that UG indel calling looks like it's working! In exomes I'd expect to find actually slightly less indels than 10% so some additional filtering is going to be necessary.

        Most LocusWalker-based tools in the GATK support the -nt option. This includes CountCovariates, UG, and VariantEval. Unfortunately, indel realigner and table recalibrator don't, as they are read walkers. We don't have a particularly good parallelization approach for read walkers yet, but it's something we'd like to do.

        Comment


        • #19
          I justed wanted to know of standard values of indel calling.
          I recently tried to analyze 2 solid exome sequencing datasets. While getting ~50K Snps I only got 11 indels (for both of them) using standard values:

          Code:
          java -jar GenomeAnalysisTK.jar \
           -glm DINDEL \
           -R resources/Homo_sapiens_assembly18.fasta \
           -T UnifiedGenotyper \
           -I sample1.bam \
           -D resources/dbsnp_129_hg18.rod \
           -o snps.raw.vcf \
           -stand_call_conf 50.0 \
           -stand_emit_conf 10.0 \
           -dcov 50
          Am I doing something wrong (I forgot to mention I am using GATK-1.0.5083)

          Comment


          • #20
            Originally posted by mdepristo View Post
            Good to hear that UG indel calling looks like it's working! In exomes I'd expect to find actually slightly less indels than 10% so some additional filtering is going to be necessary.

            Most LocusWalker-based tools in the GATK support the -nt option. This includes CountCovariates, UG, and VariantEval. Unfortunately, indel realigner and table recalibrator don't, as they are read walkers. We don't have a particularly good parallelization approach for read walkers yet, but it's something we'd like to do.
            That's good, actually. I ended up with >3200 indels, but <37,000 SNVs, so that does seem to match reasonably well.

            I'll try throwing the -nt option in for subsequent runs.

            Originally posted by ulz_peter View Post
            I justed wanted to know of standard values of indel calling.
            I recently tried to analyze 2 solid exome sequencing datasets. While getting ~50K Snps I only got 11 indels (for both of them) using standard values:

            Code:
            java -jar GenomeAnalysisTK.jar \
             -glm DINDEL \
             -R resources/Homo_sapiens_assembly18.fasta \
             -T UnifiedGenotyper \
             -I sample1.bam \
             -D resources/dbsnp_129_hg18.rod \
             -o snps.raw.vcf \
             -stand_call_conf 50.0 \
             -stand_emit_conf 10.0 \
             -dcov 50
            Am I doing something wrong (I forgot to mention I am using GATK-1.0.5083)
            For one thing, I think you can skip using -dcov 50.

            What did you use for alignment?
            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
            Projects: U87MG whole genome sequence [Website] [Paper]

            Comment


            • #21
              Originally posted by Michael.James.Clark View Post



              For one thing, I think you can skip using -dcov 50.

              What did you use for alignment?
              Actually we got BAM files from the sequencing center, as this is solid data I guess it was the Bioscope pipeline.

              Comment


              • #22
                Originally posted by ulz_peter View Post
                Actually we got BAM files from the sequencing center, as this is solid data I guess it was the Bioscope pipeline.
                I don't know Bioscope, I'm afraid.

                I asked because seeing so few indels sounds like maybe it was aligned with an ungapped aligner.

                Did you do all the other standard GATK steps? IndelRealignment/Recalibration/etc?
                Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                Projects: U87MG whole genome sequence [Website] [Paper]

                Comment


                • #23
                  DINDEL in gatk does not work

                  Hello all,

                  Using -glm DINDEL in gatk UnifiedGenotyper does not work for me, I get error message like
                  ERROR MESSAGE: SAM/BAM file SAMFileReader{/bamfolder/demo1.bam} is malformed: Adjacent I/D events in read 2:71:17399:4521

                  Actually UnifiedGenotyper works well on the bam files without -glm option.

                  Any helps?

                  Thanks

                  Comment


                  • #24
                    Originally posted by bair View Post
                    Hello all,

                    Using -glm DINDEL in gatk UnifiedGenotyper does not work for me, I get error message like
                    ERROR MESSAGE: SAM/BAM file SAMFileReader{/bamfolder/demo1.bam} is malformed: Adjacent I/D events in read 2:71:17399:4521

                    Actually UnifiedGenotyper works well on the bam files without -glm option.

                    Any helps?

                    Thanks
                    Hi bair,

                    Don't know what to say about that error. If you see a "malformed" error, running Picard FixMateInformation.jar might help.

                    Also, always include which version of the programs you're using etc. With GATK it's pretty much always best to use the latest version through the SVN.
                    Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                    Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                    Projects: U87MG whole genome sequence [Website] [Paper]

                    Comment


                    • #25
                      Originally posted by Michael.James.Clark View Post
                      Hi bair,

                      Don't know what to say about that error. If you see a "malformed" error, running Picard FixMateInformation.jar might help.

                      Also, always include which version of the programs you're using etc. With GATK it's pretty much always best to use the latest version through the SVN.
                      Thanks your Michael, I'm using gatk 2011-01-26 buildup version, since gatk works well on the bam files for snp calling, it might be the -glm DINDEL problem. will take you suggestion to try latest svn version.

                      Comment


                      • #26
                        I am trying to run -glm DINDEL with
                        -A DepthOfCoverage
                        -A AlleleBalance
                        and try to get the allele balance info to the output, but they don't seem to be there.

                        So I am interested in finding out how many reads are supporting each indel and how many are not supporting. Is there a way to output this in GATK -glm DINDEL?

                        Comment


                        • #27
                          Originally posted by bair View Post
                          Using -glm DINDEL in gatk UnifiedGenotyper does not work for me, I get error message like
                          ERROR MESSAGE: SAM/BAM file SAMFileReader{/bamfolder/demo1.bam} is malformed: Adjacent I/D events in read 2:71:17399:4521
                          I'm seeing the same error for a BAM file I am trying to call indels on using UnifiedGenotyper from GATK v1.0.5336. When I look at the problem read, there is indeed an insertion immediately adjacent to a deletion: 21I1D33M. The BAM file validates perfectly using Picard's ValidateSamFile (version 1.35).

                          I used BWA to align reads and GATK to do local re-alignment and recalibration. This read has the same alignment in both the original BWA BAM file and the re-calibrated BAM file. I did not run Picard's FixMateInformation, as I thought the mate pair information was now fixed on the fly (see http://www.broadinstitute.org/gsa/wi..._around_indels). Running FixMateInformation does not change the cigar string for the problem read in any case.

                          My command line was:

                          Code:
                          java -Xmx4g -jar ~/bin/GenomeAnalysisTK-1.0.5336/GenomeAnalysisTK.jar \
                          -l INFO -R ../hg19_index/hg19.fasta -T UnifiedGenotyper \
                          -D ../gatk_resources/dbsnp_131_hg19.rod -I bam/test.recal.bam \
                          -glm DINDEL -o test.indels.raw.vcf -stand_call_conf 30 -stand_emit_conf 10
                          Here are the SAM records for the problem read and its mate pair:

                          Code:
                          IL30_4321:7:54:11975:13960	165	chrM	1	0	*	=	1	0	AAGAGTGCTACTCTCCTCGCTCCGGGCCCATAACACTTGGGGGTAGCTAAAGTG	:>A?B=BBB=@BBBBBBB;BBBB;CCBBB@A=@AA@CBCDDCC=>CBB=??C?B	RG:Z:IL30_4321_7	MQ:i:0	OQ:Z:FFFFFCFFFFFFFFFFFFFFFFFFDFFFFFFFFFDCFFFDCFA>DFDCBBACDD	XQ:i:257
                          IL30_4321:7:54:11975:13960	89	chrM	1	0	21I1D33M	=	1	0	TTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACG	<>?@>@?BB?C>@C?<B>BBB>BC=CABABB@>?C=CCB@>C@?=CC<BBC=;<	RG:Z:IL30_4321_7	NM:i:0	OQ:Z:=DGIIEGIEDIDAIDGIIGGIIEIGIIFIEIIIFIAIIIIIIIIIIIEGIIIII	UQ:i:0
                          I did not have any trouble running the UnifiedGenotyper on this file for SNP calling.

                          Comment


                          • #28
                            is malformed: Adjacent I/D events in read

                            Originally posted by ameynert View Post
                            I'm seeing the same error for a BAM file I am trying to call indels on using UnifiedGenotyper from GATK v1.0.5336. When I look at the problem read, there is indeed an insertion immediately adjacent to a deletion: 21I1D33M. The BAM file validates perfectly using Picard's ValidateSamFile (version 1.35).

                            I used BWA to align reads and GATK to do local re-alignment and recalibration. This read has the same alignment in both the original BWA BAM file and the re-calibrated BAM file. I did not run Picard's FixMateInformation, as I thought the mate pair information was now fixed on the fly (see http://www.broadinstitute.org/gsa/wi..._around_indels). Running FixMateInformation does not change the cigar string for the problem read in any case.

                            My command line was:

                            Code:
                            java -Xmx4g -jar ~/bin/GenomeAnalysisTK-1.0.5336/GenomeAnalysisTK.jar \
                            -l INFO -R ../hg19_index/hg19.fasta -T UnifiedGenotyper \
                            -D ../gatk_resources/dbsnp_131_hg19.rod -I bam/test.recal.bam \
                            -glm DINDEL -o test.indels.raw.vcf -stand_call_conf 30 -stand_emit_conf 10
                            Here are the SAM records for the problem read and its mate pair:

                            Code:
                            IL30_4321:7:54:11975:13960	165	chrM	1	0	*	=	1	0	AAGAGTGCTACTCTCCTCGCTCCGGGCCCATAACACTTGGGGGTAGCTAAAGTG	:>A?B=BBB=@BBBBBBB;BBBB;CCBBB@A=@AA@CBCDDCC=>CBB=??C?B	RG:Z:IL30_4321_7	MQ:i:0	OQ:Z:FFFFFCFFFFFFFFFFFFFFFFFFDFFFFFFFFFDCFFFDCFA>DFDCBBACDD	XQ:i:257
                            IL30_4321:7:54:11975:13960	89	chrM	1	0	21I1D33M	=	1	0	TTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACG	<>?@>@?BB?C>@C?<B>BBB>BC=CABABB@>?C=CCB@>C@?=CC<BBC=;<	RG:Z:IL30_4321_7	NM:i:0	OQ:Z:=DGIIEGIEDIDAIDGIIGGIIEIGIIFIEIIIFIAIIIIIIIIIIIEGIIIII	UQ:i:0
                            I did not have any trouble running the UnifiedGenotyper on this file for SNP calling.
                            I have the exact same problem using stampy for alignments. Are you using stampy to generate alignments? Perhaps this is unexpected behaviour coming from an atypical aligner?

                            Comment


                            • #29
                              We have done some alignments with Stampy and seen the exact same problem of adjacent insertions and deletions. Also, sometimes Stampy seems to align the reads with weird indels, and they do not seem to be fixed even with GATK IndelRealigner (whereas alignments from BWA seem to be aligned fine with GATK IndelRealigner).

                              Comment


                              • #30
                                Originally posted by siwright View Post
                                I have the exact same problem using stampy for alignments. Are you using stampy to generate alignments? Perhaps this is unexpected behaviour coming from an atypical aligner?
                                Yes, I was using Stampy 0.1.12 to generate the alignments. The problem is fixed in Stampy 0.1.13 as far as I can tell. I've generally been pretty happy with the indels generated from a Stampy > realign > score recal > unifiedgenotyper pipeline.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin


                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                  Yesterday, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                39 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                41 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                35 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                55 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X